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The two aspects of reading which have received the most attention 
are speed and comprehension. Various tests have been devised to test 
either the speed or the comprehension factor. The content and 
structure of these tests would seem to indicate that there exists little 
agreement among authors concerning the most adequate method of 
measuring either speed or comprehension. The tendency has been to 
measure speed on the ‘‘no comprehension” level, 7.e., on very easy 
narrative where comprehension is practically perfect for all individuals 
of the group tested. Rate is in terms of the amount done within a set 
time limit. Comprehension has been measured, for the most part, on 
relatively difficult material. The score usually consists of the number 
of comprehension questions on the text that are answered correctly. 
In general, it seems to have been assumed by the authors as well as 
by the users of the tests that the tests were measuring a general speed 
of reading factor. ‘Tinker? has pointed out, however, that ‘‘there are 
many reading skills rather than either a general silent reading ability, 
a general comprehension ability or a general speed of reading ability.” 

When two speed of reading tests consist of strictly comparable 
material and the techniques of measurement are identical, their inter- 
correlation is high. As the content of the second test differs more and 
more from the first or the technique of measurement in the second 
diverges from that in the first, the intercorrelation becomes lower. A 





‘The expense of this study was met by a research grant from the Graduate 
School, University of Minnesota. 

? Tinker, M. A.: “The relation of speed to comprehension in reading.” School 
and Society, Vol. xxxvi, 1932, pp. 158-160. 
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similar trend is found for the relationship between comprehension 
tests. It is not surprising, therefore, when speed “‘tests’’ correlate low 
with “comprehension” tests. Tinker! has concluded that such inter- 
correlations cannot be expected to reveal a valid relation between 
speed and comprehension. All that they can show is the correlation 
between speed of reading in one situation and comprehension in 
another. 

Speed of reading can have no other meaning than speed of compre- 
hension, since ‘‘reading”’ without comprehension is not reading. In 
practice, therefore, we are interested in the relationship between rate 
of work and degree of understanding in a particular reading situation. 
One wants to know the correlation between rate of work and amount 
of comprehension in reading history, prose narrative, algebra, physics, 
etc. It would seem that the adequate technique for discovering the 
true relation between speed and comprehension in reading is to 
measure rate of work and comprehension on the same or strictly com- 
parable material in each specific reading situation. 

In this study, therefore, speed of reading is defined as rate of 
comprehension in each test used. Comprehension is defined in 
terms of how it is measured in each test. That is, the number of test 
items completed correctly within a set time limit is assumed to be the 
comprehension score. Although some may not agree with the method 
of measuring comprehension in a particular test, that need not con- 
cern us here. 

In a preliminary study, Anderson and Tinker? employed the 
technique outlined above to investigate the correlation between speed 
and comprehension in reading. The test employed proved to be 
fairly easy for university sophomores. The question was raised 
whether a change in the difficulty of the test would bring a change in 
the discovered correlation. The plan of the present study is to 
investigate the relation between speed and comprehension (1) by 
measuring rate of work and degree of comprehension on the same or 
strictly comparable material and (2) by employing as reading material 
tests ranging from very easy (‘‘no difficulty” level) to extremely diffi- 
cult material. Two forms of each test were used so that reliabilities 





1 Tink er M. A.: “The relation of speed to comprehension in reading.”’ School 
and Society, Vol. xxxvi, 1932, pp. 158-160. 

2 Anderson, V. L. and Tinker, M. A.: “The speed factor in reading perform- 
ance.” J. Educ. Psychol., Vol. xxvu, 1936, pp. 621-624. 
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might be computed on all measures and corrections for attenuation 
made. 

The standardized reading tests employed, listed roughly in order 
from easy to difficult, follow: (1) Forms A and B of the Chapman- 
Cook Speed of Reading Test;! (2) Minnesota Speed of Reading Test ;? 
(3) Iowa Silent Reading Test* (Part 1, Paragraph Meaning); (4) Ohio 
State University Psychological Test* (Part 5, Reading Comprehension 
Test); (5) Minnesota Reading Examination;? and Reading Scales in 
Educational Psychology.! 

The subjects for the experiment were university students, prac- 
tically all from the sophomore class. 

In all reading situations the following procedure was adopted: 
With the exception of the Chapman-Cook examination (group testing), 
all tests were given individually with a standard time limit and also 
timed for the rate of working. The standard time limits were set 
empirically so that the fastest student in each group tested could 
almost but not quite complete each test. The subjects were directed 
to work rapidly and consistently, but not to sacrifice accuracy for 
speed. Each reader was allowed to work on a test until the experi- 
mentally determined standard time had elapsed; at which point, he 
was interrupted and a line drawn across the page below the last item 
attempted. Instructions were then given to complete the test, and 
when the last item was finished the total time required for the whole 
examination was recorded. This technique was considered to yield a 
fairly adequate rate of work score for members of the group as a 
whole. It will be shown later that this assumption is apparently 
justified. 

Three scores were derived from the data: (1) Number of items done 
correctly in standard time; (2) number of items attempted in standard 
time; (3) total time taken to complete the test. The number of 
items done correctly in unlimited time tended to yield a very restricted 
range of scores and a distribution that deviated markedly from the 
normal. This score, therefore, was not employed in the statistical 
comparisons. 

The results will be presented and discussed in the order from easy 
reading material to the more difficult. With the Chapman-Cook Test, 





1 Published by Educational Test Bureau, Minneapolis. 

? Published by University of Minnesota Press, Minneapolis. 
3 Published by World Book Co., Yonkers-on-Hudson. 
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the procedure was slightly different from that outlined above. The 
comprehension score was measured on one form and the speed score 
on an equivalent form. The means! and standard deviations for the 
three groups measured are presented in Table I. Three comparable 
groups were measured. It was necessary to test Groups II and III 
to furnish data for the reliability computations. The means show 
that the number right and the number attempted in standard time 
(one and three-quarters minutes) are almost identical (Group I). As 
indicated above, this test is on the ‘‘no difficulty” level. University 
students seldom do an item incorrectly. The reliability coefficients 
for each type of score are derived from Groups II and III. These 


TABLE I.—M£EanS AND SD’s ror CHAPMAN-COOK SPEED OF READING TEST 





Measure | M | g 





Group I. N = 270 University students 




















Number attempted in standard time: Form A................ 21.0 5.0 

Number right in standard time: Form A.................... 20.9 5.0 

ee I On NS UN BD. nc ccc ccc cccccsvesccccceses 3.0 0.9 
Group II. N = 163 University students 

ee IIS IN Bh go ccc iccc cc cencevescecesass 2.6 0.7 

ES PIE BD once ccc ccecesececseesves 2.8 0.8 
Group III. N = 183 University students 

Number attempted in standard time: Form A................ 19.0 4.3 

Number attempted in standard time: Form B................| 18.5 3.7 

Number right in standard time: Form A.....................} 18.9 4.3 

Number right in standard time: Form B..................... 18.3 3.7 

a 











coefficients, from correlating scores on two equivalent forms, are 
listed below: 
I. Number attempted in standard time, r = .81. 
II. Number right in standard time, r = .80. 
III. Total time, r = .85. 
All coefficients are consistently high, and the rate of work scores are 
slightly more reliable than the others. 





1 Part of these data are taken from another investigation conducted by D. G. 
Paterson and M. A. Tinker. 
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The intercorrelations of speed and accuracy scores are keyed as 
indicated above (I, II, III). The correlations and the corrections 
for attenuation follow: 

First group: 

I, Form A, vs III, Form B, r = —.83; attenuation r = —1.00. 

II, Form A, vs. III, Form B, r = —.83; attenuation r = —1.00. 

I, Form A, vs. II, Form A, r = .997. 
The number attempted in standard time is as good a rate measure as 
the time taken for the whole test (I vs. III). This is also uniformly 
true for all succeeding tests in the series. On this test (Chapman- 
Cook), speed and comprehension scores yield practically identical 
standings (II vs. III and I vs. II). 

In the next reading situation, the Minnesota Speed of Reading 
Test constructed for use with university students, the text is con- 
siderably more difficult than in the Chapman-Cook, which was 
designed for elementary-school children. The means and standard 
deviations are listed in Table II (standard time = five minutes). 


Although high, the accuracy in this test is not as great as in the 
Chapman-Cook. 


TABLE IJ.—MEANs AND SD’S ror MINNESOTA SPEED OF READING TEST 
N = 143 University Sophomores 





Form A Form B 
Measure 





M o M o 





Number attempted in standard time................ 27.1 | 5.2 | 25.4 | 5.0 
Number right in standard time..................... 23.9 | 4.8 | 25.2 | 5.3 
EFT PT OF EET OTP TT CT EET PETE 8.3),1.6] 7.5] 1.6 

















The reliability coefficients (Form A vs. Form B) follow: 
I. Number attempted in standard time, r = .77. 
II. Number right in standard time, r = .69. 
III. Total time, r = .81. 
Again the consistency of response is higher for the rate of work scores. 
With the scores keyed as indicated in the reliability outline (I, 
IT, III), the intercorrelations of speed and accuracy scores follow: 
I vs. III, Form A,r = —.91. 
II vs. III, Form A, r = —.81. 
I vs. II, Form A, r = .90. 
I vs. III, Form B, r = —.92. 
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II vs. III, Form B, r = —.84. 
I vs II, Form B, r = .91. 
Speed correlates very high with comprehension. The results from the 
two forms are very similar. 

For those who would like to see the size of these relationships when 
speed and comprehension are not measured upon identical material, 
the coefficients given below were computed (Method II). The speed 
score is taken from Form A and the comprehension score from Form B 
or vice versa. It should be noted that, with the exception of the 
Chapman-Cook Test, the textual materials in the two forms of the 
tests used in this study are not strictly comparable. The correlations 
and the coefficients corrected for attenuation in Method II follow: 


I, Form A vs. III, Form B, r = —.71; attenuation r = —.90. 
I, Form B vs. III, Form A, r = —.80; attenuation r = —1.01 
II, Form A vs. III, Form B, r = —.67; attenuation r = —.90 
II, Form B vs. III, Form A, r = —.69; attenuation r = —.92 


I, Form A vs. II, Form B, r = .66; attenuation r = .91 
I, Form B vs. III, Form A, r = .72; attenuation r = .98 

The coefficients are somewhat lower here than when speed and 
comprehension are measured on identical material. Nevertheless, when 
corrected for attenuation, the relation between speed and comprehen- 
sion is almost unity. The writer believes that the former method 
(identical material) is more adequate, especially since one has no 
satisfactory method for equating comparability of texts in the two 
forms of a test. However, the trend of the intercorrelations is similar 
in Method I and Method II, and the inferences drawn are justified by 
either set of data. 

It will be noted that the intercorrelations in Method I are not 
corrected for attenuation while those in Method II are corrected. 
The validity of the use of the correction formula rests upon the 
assumption that errors of measurement are uncorrelated with each 
other and with the paired scores. Failure of the data to meet this 
assumption may result in absurd values of r, that is, yield correlations 
greater than 1.00. In Method I this assumption is not justified for 
each set of paired scores is derived from the same material; that is, 
certain errors affecting score in standard time will also affect number 
attempted in standard time, and the total time score. Such errors 
enter both scores (X; and X2) in any pair and are positively correlated. ' 





1The writer gratefully acknowledges suggestions from Professor Jack W. 
Dunlap concerning the statistical treatment employed in this study. 
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Therefore, the correction for attenuation is employed only in 
Method II. 

Part I, Paragraph Meaning of the lowa Silent Reading Test was 
employed as the third reading situation. This examination, designed 
for high schools and colleges, is of medium difficulty. The means 
and standard deviations of the scores are listed in Table III. About 


TaB.eE IIJ.—MeEans anv SD’s For Iowa Sitent Reapine TEst, Part I 
N = 100 University Sophomores 





Form A Form B 
Measure 








Number attempted in standard time................ 19.4 | 4.5 | 21.0 | 4.3 
Number right in standard time..................... 17.6 | 4.5 | 18.1 | 3.8 
ec ve haes Ve aeus sake ed eae 10.8 | 2.2 | 10.8 | 2.0 

















eighty-five to ninety per cent of the items are done correctly in stand. 
ard time (eight minutes). The reliability of the measures (Form A vs- 
Form B) follow: 
I. Number attempted in standard time, r = .65. 
II. Number right in standard time, r = .60. 

III. Total time, r = .72. 

These coefficients are not as high as the split-half reliabilities cited 
in the published manual. Note that the rate of work scores are con- 
siderably more consistent than the number of items correct in stand- 
ard time. 

The intercorrelations are given below, keyed according to the 
reliability outline (I, II, III). The coefficients by Method I (rate 
and accuracy scores on identical material) follow: 

I vs. III, Form A, r = —.84. 
II vs. III, Form A, r = —.79. 
I vs. II, Form A, r = .93. 
I vs. III, Form B, r = —.82. 
II vs. III, Form B, r = —.72. 
I vs. Il, Form B, r = .92. 
The trends are identical for the two forms of the test. The number 
attempted in standard time (I) is a good measure of rate of work on 
the whole test (III). There is an intimate relation between the speed 
and comprehension scores. 
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The intercorrelations by Method II (rate and accuracy from dif- 
ferent forms of test) are listed below. The coefficients are somewhat 
smaller but the trends are similar to those discovered by Method I. 


I, Form A vs. III, Form B, r = —.61; attenuation r = —.89. 

I, Form B vs. III, Form A, r = —.73; attenuation r = —1.06. 
II, Form A vs. III, Form B, r = —.51; attenuation r = —.77. 
II, Form B vs. III, Form A, r = —.71; attenuation r = —1.09. 


I, Form A vs. II, Form B, r = .65; attenuation r = 1.04. 

I, Form B vs. II, Form A, r = .60; attenuation r = .97. 
The differences between results here and in Method I are probably 
due to the possibility that textual material is not strictly comparable 
in the two forms. This also may explain the relatively low reliability 
coefficients as here computed. 

The fourth test in our series is the Reading Comprehension Test, 
Part 5, of the Ohio State University Psychological Test. This reading 
material is distinctly more difficult than that in the preceding test. 
The carefully selected paragraphs, however, are of descriptive and 
scientific material which require no special background to understand. 
(This is not true for the next two tests.) The means and standard 
deviations are listed in Table IV (standard time = thirty-six minutes). 


TaBLE IV.—MEaANs AND SD’s FoR READING COMPREHBNSION TEST (PART 5) 
oF Ou10 STaTE UNIVERSITY PSYCHOLOGICAL TEST 
N = 69 University Sophomores 











Form 9 Form 12 
Measure ins 

M a M o 
Number attempted in standard time.............. 81.9 | 11.8 | 86.4 | 12.3 
Number right in standard time................... 62.5 | 11.0 | 69.4 | 10.8 
IEE A 56.1 | 11.9 | 45.3 | 9.3 

















The difficulty of this test is revealed by the fact that only seventy-five 
to eighty per cent of the number attempted in standard time are 
correct. The reliabilities (Form 9 vs. Form 12) of the measures 
follow: : ' 
I. Number attempted in standard time, r = .70. 
II. Number right in standard time, r = .76. 
III. Total time, r = .68. 
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For the first time in our series of tests, the coefficients for the rate of 
work scores are no higher than for the accuracy score. 

The intercorrelations by Method I, keyed as indicated above 
(I, II, III) are given below: 


I vs. III, Form 9, r = —.93. 

II vs. III, Form 9, r = —.68. 

I vs. II, Form 9, r = .69. 

I vs. III, Form 12, r = —.92. 
II vs. III, Form 12, r = —.77. 


I vs. Il, Form 12, r = .84. 
These coefficients reveal the equivalence of the rate scores (I and III) 
and that speed is intimately associated with comprehension (I vs. II 
and II vs. III). 

The comparable correlations by Method II (time and comprehen- 
sion on different forms) follow: 

I, Form 9 vs. III, Form 12, r = —.68; attenuation r = —.99. 
I, Form 12 vs. II], Form 9, r = —.69; attenuation r = —1.00. 

II, Form 9 vs. III, Form 12, r = —.53; attenuation r = —.74. 

II, Form 12 vs. III, Form 9, r = —.58; attenuation r = —.81. 

I, Form 9 vs. II, Form 12, r = .57; attenuation r = .78. 
I, Form 12 vs. II, Form 9, r = .68; attenuation r = .93. 

The next situation, the Minnesota Reading Examination, Part II, 
is considerably different in both difficulty and subject-matter from 
previous tests in the series. The means and standard deviations of 
the scores are given in Table V (standard time = twenty-five minutes). 
Difficulty of the test is shown by only sixty to seventy per cent 
correct reponses on items attempted in standard time. 


TABLE V.— MEANS AND SD’s For MINNESOTA READING EXAMINATION 
N = 77 University Sophomores 











Form A Form B 
Measure - 
M o M o 
Number attempted in standard time................ 28.7 | 3.9 | 29.2 | 4.3 
Number right in standard time..................... 17.6 | 3.8 | 19.4 | 4.0 
ds ee ed ea bb eae e awe ean 30.7 | 5.4 | 30.7 | 5.4 

















Reliabilities of the scores (Form A vs. Form B) are given below: 
I. Number attempted in standard time, r = .73. 
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II. Number right in standard time, r = .62. 
III. Total time, r = .76. 
The reliabilities of the time scores (I and III) are considerably higher 
than for the accuracy score (II). 
The intercorrelations of time and accuracy scores by Method I 
follow: 
I vs. III, Form A, r = —.90. 
II vs. III, Form A, r = —.38. 
I vs. II, Form A, r = .42. 
I vs. III, Form B, r = —.88. 
II vs. III, Form B, r = —.57. 
I vs. II, Form B, r = .66. 
These coefficients show equivalence for the two rate scores (I and 
III). The correlation of either speed score with the comprehension 
score is not high. The degree of relationship, however, is considerably 
higher for Form B than for Form A. Perhaps adaptation to the test 
material is affecting the results on Form B. We shall point out in the 
discussion that relatively low correlations on results from this test 
are probably due to at least two factors: (1) Difficulty of reading 
material, and (2) type of textual material. A similar condition is 
found in results on the next test used. 
The intercorrelations by Method II follow. The trend is the 
same as by Method I. 


I, Form A vs. III, Form B, r = —.67; attenuation r = —.90. 

I, Form B vs. III, Form A, r = — .76; attenuationr = — 1.02. 
II, Form A vs. III, Form B, r = —.35; attenuation r = —.50. 
II, Form B vs. III, Form A, r = —.45; attenuation r = —.66. 


I, Form A vs. II, Form B, r = .40; attenuation r = .59. 

I, Form B vs. II, Form A, r = .46; attenuation r = .69. 
The final test of the series, Reading Seales in Educational Psychol- 
ogy, is in some respects similar to the test just described; that is, it is 


TABLE VI.—MEANS AND SD’s For READING SCALES IN EDUCATIONAL PSYCHOLOGY 
N = 100 University Students 

















Form A Form B 
Measure 
M o M o 
C-score in standard time................cccccceees 95.5 | 6.1 | 95.1 | 5.9 
Ne en Cae haie sl ke ke ke ee 33.1 | 6.0 | 32.3 | 5.4 
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relatively difficult and demands specialized knowledge for satisfactory 
comprehension of the text. The means and standard deviations of 
the scores are given in Table VI (standard time = twenty-five 
minutes). The specialized method of scoring this test prevents 
using the score ‘‘Number attempted in standard time” in our 
computations. 

The reliabilities (Form A vs. Form B) are listed below: 

II. C-score in standard time, r = .66. 

III. Total time, r = .80. 
The consistency of response is much greater in the rate of work score. 

The intercorrelations, keyed as indicated above, follow: 

Method I: II vs. III, Form A, r = —.53. 


II vs. III, Form B, r = —.48. 
Method II: II, Form A vs. III, Form B, r = —.46; attenuation 
r= —.64. 
II, Form B vs. III, Form A, r = —.37; attenuation 
r= —.5l. 


These coefficients are low when derived by Method I, and even lower 
by Method II. Obviously, rate of work is not intimately associated 
with comprehension in this test. 


DISCUSSION 


The reliabilities in this study, computed by correlating scores on 
two forms of the tests, are not high with the exception of the Chap- 
man-Cook Test. This may be due to two factors: (1) Lack of consist- 
ent performance and (2) lack of comparability of the two test forms, 
especially in type of textual material. It has been established that 
some of these tests are internally highly consistent (split-half correla- 
tion). With material like the contents of the Chapman-Cook Test, 
it is easier to approximate comparable text from form to form than in 
the other tests, especially the more difficult ones. It is well known 
that variation in textual material from one reading test to another 
lowers the intercorrelation. 

It is interesting to note that in all but one instance the rate of 
work scores yield higher reliability coefficients than the comprehension 
scores. The average correlation of .77 for total time indicates that 
it is measuring a highly consistent form of response. 

The difference between the mean scores for number attempted and 
for number correct in standard time reveals the relative difficulty of the 
tests used. These scores range from no difference in the Chapman- 
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Cook to a marked difference in the Minnesota Reading Examination, 
where the accuracy of response was only sixty to seventy per cent. 

In this study the criterion of relationship between speed (rate of 
work) and comprehension is the correlation between (1) total time to 
complete the test and comprehension scores in standard time or (2) 
number of items attempted in standard time and comprehension 
scores in standard time. As noted above, the method of measuring 
understanding in each test was accepted as measuring comprehension 
in that test. Inspection of the intercorrelations reveals consistently 
high correlations between number attempted in standard time and 
total time of work on the whole test, which is a pure rate measure. 
The mean uncorrected coefficient for all tests was —.88 (Method I). 
This indicates, of course, that the two scores are practically identical 
and may be interchanged. The number of items attempted within a 
time limit which permits the fastest worker in the group to almost but 
not quite finish the test appears to be an entirely satisfactory measure 
of rate of work for reading in an understanding manner the type of 
material in each of the tests employed. Furthermore, this seems to 
justify our earlier contention that the time taken for the whole test 
should be representative of the reader’s rate of work during standard 
time as defined’in this study. 

We have stated that both rate of work and comprehension should 
be measured on identical or strictly comparable material if we are to 
make a valid comparison of the relationship between the two. ‘The 
only test in our series with two forms that we know are strictly com- 
parable is the Chapman-Cook Test. In the other tests, therefore, the 
intercorrelations between rate and comprehension have been done with 
identical material (Method I). These correlations have been sup- 
plemented by similar computations in which the speed measure is 
taken from one form of the test and the comprehension scores from 
another form (Method II). Although the uncorrected correlations 
are lower by Method II (corrected correlations in II are much like 
the uncorrected correlations in I), the trends are similar and the 
inferences to be derived are the same. 

Examination of the series of correlations between rate of work 
and degree of comprehension reveals a definite trend. With very 
easy material the correlations are very high. As the tests become 
more difficult the size of the coefficients becomes smaller. From easy 
to difficult tests the mean coefficients in Method I are, respectively, 
.93, .87, .84, .73, .51, and .48. In Method II the corresponding 
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uncorrected coefficients are: .83, .69, .62, .59, .42, .42; corrected for 
attenuation: 1.00, .93, .97, .82, .61, .58. This trend seems to indicate a 
definite relationship between the size of the coefficient and the difficulty 
level of the reading material. As the material becomes harder, the 
correlation is lowered. 

There seems to be at least two factors determining the size of these 
correlations between speed and comprehension. First, there is the 
variation in difficulty of the textual material. The degree of accuracy 
declines consistently from test to test in the series with a rather sharp 
break between the fourth (Ohio State Reading Comprehension) and 
the fifth (Minnesota Reading Examination). The reason for the 
decrease in size of coefficient with increase in difficulty of the text may 
be surmised. It seems probable that, as the material gets harder, the 
consistency of rate of work tends to fluctuate by varying degrees from 
reader to reader. The amount of this fluctuation is undoubtedly 
determined by long standing habits of work which may vary con- 
siderably from person to person. 

A second factor which is probably involved in lowering the speed 
vs. comprehension correlations in the difficult material is the kind of 
text being read. We have noted that a decided drop in the coefficients 
came only with the last two tests. In the first four examinations no 
special background was essential to an understanding of the material 
involved. The other two tests, however, did require considerable 
specialized information for adequate assimulation of the material. 
This point is best illustrated by quoting a paragraph! from the fifth 
test, the Minnesota Reading Examination: 


The final issue to which the Fuastian wisdom tends—though it is only in 
its highest moments that it has seen it—is the dissolution of all knowledge 
into a vast system of morphological relationships. Dynamics and Analysis 
are in respect of meaning, form, language, and substance, identical with 
Romanesque ornament, Gothic cathedrals, Christian-German dogma and 
the dynastic state, one and the same world-feeling speaks in all of them. They 
were born with, and they aged with the Faustian culture, and they present 
that culture in the world of day and space as a historical drama. The uniting 
of the several scientific aspects into one will bear all the marks of the great 
art of counterpoint. An infinitesimal music of the boundless world space— 
that is the deep unresting longing of this soul, as the orderly statuesque and 
Euclidean Cosmos was the satisfaction of the classical. That—formulated 
by a logical necessity of Faustian reason as a dynamic-imperative causality, 





1 Paragraph X from Form A, Minnesota Reading Examination, Part II. 
Permission to quote granted by University of Minnesota Press. 
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then developed into a dictatoria, hardworking, world transforming science—is 
the grand legacy of the Faustian soul to the souls of culture yet to be, a bequest 
of immensely transcendent forms that the heirs will possibly ignore. And 
then, weary after its striving, the western science returns to its spiritual home. 


Although there is a wide selection of subject-matter in the various 
paragraphs of the test, the obstruseness of much of the material is 
illustrated by the above quotation. The methods of response and of 
scoring in the last two tests of our series are quite different from those 
in the four easier tests. This probably added to the difficulty of 
obtaining correct scores. 


SUMMARY AND CONCLUSIONS 


1. Rate of work and comprehension scores were obtained on six 
reading tests ranging from the ‘‘no difficulty” level to very difficult. 

2. The number of items attempted within a standard time limit 
that allowed the fastest workers to almost but not quite complete all 
items of a test was found to correlate consistently high with the time in 
minutes taken to complete the whole test. Either score, therefore, 
can be used as a rate of work measure. 

3. The correlation between rate of work and comprehension is 
very high for éasy material but decreases steadily as the difficulty of 
the tests increases. This lowering of the degree of correlation is not 
appreciable, however, until the reading material becomes very difficult, 
as in tests 5 and 6 of our series. | 

4. There is a suggestion that type of material including kind of 
response required in the reading test also affects the correlation 
between speed and comprehension. Material which requires a special 
background of training for its interpretation appears to lower the 
correlation. 

5. The data warrant the conclusion that there is an intimate rela- 
tionship between speed and comprehension in reading when the 
textual material is within the reader’s educational experience. 
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THE EFFECT OF “RIGHT” AND “WRONG” UPON 
THE LEARNING OF NONSENSE SYLLABLES 
IN MULTIPLE CHOICE ARRANGEMENT 


J. W. TILTON* 
Yale University 


I. INTRODUCTION 


In Thorndike’s Fundamentals of Learning?! are reported, in detail, 
the experiments which he has elsewhere described as follows”? p. 34: 


The general plan of all the experiments is the same. In every case the 
person may do any one of several things, of which one is right and the rest 
wrong. If he does the right thing, he is then and there rewarded. If he does 
the wrong thing, he is then and there punished. For illustration here I will 
take cases where there are five acts possible. Call tham FR (the right one) and 
X1, X2, X3, and X4 (the four wrong ones). For example, I show the person 
a German word followed by five English words, one of which is the correct 
translation, the other four being wrong. He chooses one of the five, and is 
rewarded if it is right and punished if it is wrong. We do the same for one 
hundred ninety-nine other German words, and then repeat the two hundred, 
and so on until many or all are learned. 

The extraordinary result is that in such cases the punishments do no good 
whatever. Punishing X1, or X2, or X3, or X4 does not make it less likely to 
occur. The person improves only because of the rewards for Rk. If the 
person does Ft and is rewarded he is more likely to do & the next time. But if 
he does X1 and is punished, he is not less likely to do X1 the next time. The 
wrong tendencies are not reduced in strength one jot or tittle by the punish- 
ment. If the person gets rid of them, it is simply and solely because the R 
tendency becomes so strong that it displaces X1, X2, X3, and X4. All the 
improvement in these experiments is due to the rewards. 


Experiments in which the intensity of the punishment or of both 
reward and punishment were varied have been interpreted by Lorge,’ 
Tuckman”’ and Rock" as confirming Thorndike’s conclusion as to the 





* The writer is indebted to the students who acted as subjects; to President 
F. E. Engleman, New Haven State Teachers College, for his codperation; to the 
Yale University Committee on Bursary Appointments for the assistance of a 
bursary student, Ward Tibbetts; to the National Youth Administration for the 
assistance of Abraham Knepler, a graduate student; to Chairman C. M. Hill, Yale 
Department of Education, for the time of a research assistant, Mary T. Brammel; 
and to these three assistants—especially to Mrs. Brammel who carired the brunt 
of the load—for generous and capable help. 
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nature of the influence of punishment. Lorge wrote (p. 204) that 
‘consistently, the finding has been that a wrong response does more 
harm by occurring than the punishment can offset.”” Tuckman con- 
cluded (p. 41) that ‘‘no intensity of punishment prevented the occur- 
rences of the punished connections from being harmful to learning.” 
According to Rock (p. 77), ‘‘Punished wrong responses were repeated 
in a larger percentage of cases than would be expected by mere chance, 
and this was found to hold even for the responses given the greatest 
punishments.” 

There is a tendency on the part of educators to accept the conclu- 
sion uncritically because they believe with Thorndike ‘“‘that the value 
of punishment has been much exaggerated in both theory and prac- 
tice.’”’ This tendency is reflected in reviews of Thorndike’s work, 
written by McAndrew!! and Jones.’ But from the point of view of 
experimental psychology, questions have been raised concerning 
Thorndike’s work. Among these, the following two seem to have been 
given most attention: 

(1) Was “‘right’”’ a reward and “‘wrong”’ a punishment? Tolman, 
Hall, and Bretnall,*° considered it more reasonable to interpret them 
as “‘emphasizers.” ‘Our finding is that an emphasis upon success is 
more helpful than an emphasis on lack of success. The emphasis 
upon the latter would seem to act, perhaps, as a baleful fascination 
which attracts the performer and which makes the avoiding of the 
response in question the more difficult.’”’ Thus, although they ques- 
tioned Thorndike’s interpretation, they considered that their “finding 
is really in line with that obtained by Thorndike, when he discovered 
that it aided learning more to tell a subject when he was right than to 
tell him when he was wrong.” The Tolman, Hall, Bretnall experi- 
ments have been criticized by Goodenough? and by Muenzinger!? and 
accepted as verified by Hulin and Katz* and by Silleck and Lapha."” 
However one may interpret this experimentation, the possibility, if 
not the probability, of the operation of an emphasis factor must be 
admitted. This leads to a consideration of the other question. 

(2) In the Thorndike experiments first referred to, zero was defined 
as what he thought would have happened if his subjects had not been 
told when right and wrong. In order to determine the effect of 
“right” and “‘wrong”’ apart from their possible influence as empha- 
sizers, should he not have chosen as his zero the effect of a non-informa- 
tive announcement in the medium used in announcing “right”’ and 
“wrong”? Lorge and Thorndike have themselves considered this 
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question. In two experiments!® the subjects’ responses were followed 
by “‘right”’ plus a click, ‘‘wrong”’ plus a click, or a click alone. Com- 
pared with the effect of the click alone as a zero, the effect of “‘wrong”’ 
was opposite to that of “‘right,”’ but the authors speak of this as ‘the 
lesser strengthening by a ‘wrong’ than by the ambiguous ‘click 
alone.’’’ By assuming that the click alone permitted the subject to 
believe that his response was right, they accounted for their facts 
without attributing a weakening influence to ‘“‘wrong.’”’ Lorge, 
Eisenson, and Epstein® added two more experiments to this series and 
made the same interpretation of their data. Stephens!’ performed a 
series of four experiments, accepted his point of reference as a satisfac- 
tory zero, and concluded that ‘‘when measured from the base line of 
‘informationless something happening’ punishment and reward seem 
to have distinctly opposite (if not equal) effects.”” In a refinement of 
these experiments by Stephens and Baer'® the influence of “ wrong”’ 
was not statistically significant. 

For two reasons, the writer’s experiments are not directed to the 
questions stated above, but to the possibility of a misinterpretation of 
Thorndike’s 1932 data from other considerations. The reasons are 
these. First, the degree of certainty or uncertainty attached to 
Thorndike’s 1932 conclusions will play an important part in the experi- 
mental effort to evaluate the influence of each of the factors—punish- 
ment, emphasis, information, etc. This is illustrated in the different 
conclusions drawn from similar data by Stephens on the one hand, and 
by Lorge and Thorndike, and Lorge, Eisenson, and Epstein on the 
other. Thorndike and his collaborators might just as well have argued 
that the click alone allowed the subject to think that his response was 
wrong and that, but for this fact, “‘wrong’’ would have shown a weak- 
ening influence equal to the strengthening by ‘‘right.”” If this occurred 
to them it was dismissed as not in accord with their interpretation of 
their previous work. The second reason for turning attention to the 
original experiments is that, to the extent that experimentation follows 
theoretical interest, there will remain in the literature of educational 
psychology, insufficiently criticized and insufficiently qualified, 
Thorndike’s finding that telling his subjects when they were wrong did 
more harm than good. We turn, then, from questions of punish- 
ment, emphasis, and information to two questions concerning the 
effect of “right”? and “‘wrong”’ which, although of less theoretical 
interest, are not without theoretical significance, and certainly are of 
practical importance. 
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II. QUESTIONS TO BE INVESTIGATED 


Question One.—In Thorndike’s measurement of the effect of 
“‘right”’ and of “‘wrong,”’ is there a likelihood of the zero or base line 
having been in error? Was the assumption of chance repetition justi- 
fied as a measure of what would have happened had no announcement 
been made? If his zero was in error, was the error of such a nature as 
to have negated a contribution of ‘‘wrong” to learning? Thorndike 
justified his use of a chance base”! (pp. 278-284) as follows: He assumed 
that responses for which the tendency was greater than chance, if there 
were any, would most likely be among the responses on trial one; they 
would next most likely be among the responses made for the first time 
on trial two, and soon. He omitted from consideration the responses 
made on trial one, and then estimated the need for further correction 
by an implied assumption that “‘right”’ and ‘“‘wrong”’ would have an 
equal effect upon unequally strong response tendencies. On the basis 
of the first assumption and with the aid of the second, he examined 
some of his data for evidence of greater strength among the responses 
appearing in a given trial for the first time, than among those appear- 
ing in the next trial for the first time. Finding no significant differ- 
ence, he concluded that it was safe to assume, apart from the learning 
he was studying, that the repetition of the responses he studied was a 
matter of chance. Hull has expressed himself‘ as not convinced by 
Thorndike’s assurances and cites the group experiment of Stephens" 
in which thirty-six per cent of second-choice responses were repeated, 
although the subjects were not told in these cases whether they were 
right or wrong. It is true that Thorndike used not only second, but 
third and fourth choices as well. The third-choice responses in his 
experiments might have had chance strength, second-choice responses 
a greater strength, and fourth choices less, the whole yielding a close 
approximation to chance. His answer on this point, based on the 
assumptions stated above, may not be accepted in lieu of an experi- 
mental determination. 

Lorge and Thorndike give the impression of having determined 
their base line experimentally’ (p. 375) by studying the favoritism of 
individual subjects. ‘‘Each subject’s record is inspected to detect 
any reversals of favoritism in consecutive trials with the same series, 
any systematic shifts of favoritism, and any clear cases of short-lived 
habits of favoritism.’”’ They report, ‘‘There are no such reversals, 
shifts, or short-lived habits of importance. In general, the amount of 
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favoritism is such that the probability that, apart from occurrences, 
rewards, and punishments, the same response would be repeated in 
two consecutive trials is not much above the .250 which would result if 
favoritism were zero.” 

It appears from the context that this inspection of favoritism did 
not afford any information regarding favoritism for particular response 
words. If this was the case, their conclusion should have been more 
carefully qualified. It is obvious that a very strong favoritism for 
particular words, which would in itself produce far more than a chance 
amount of repetition, would not be revealed as favoritism for position, 
if the favorites were distributed equally among the possible positions. 
Such a strong favoritism was shown by Stephens’ subjects’ and again 
in the experiments of Stephens and Baer.'® 

Question Two.—Did the multiple-choice situation, in which there 
were four wrongs but only one right, prejudice the case in favor of 
“right”? Might a situation in which four responses were right and 
only one wrong have yielded different results? In reviewing Thorn- 
dike’s findings, Peterson'*® wrote, ‘‘ These results are, of course, to be 
expected, since only one response could be correct, and four of the five 
possible ones could be wrong, so that each call of Right definitely 
locates and emphasizes the correct response, while the call of Wrong 
does neither of these, but may even produce a confusion of impulses.” 
Lorge’ introduced a modification of Thorndike’s procedure to take 
care of this situation. He arranged four possible responses and called 
two right and two wrong. ‘The results, as already stated, were similar 
to those of the earlier studies. These experiments show that the 
failure of “‘wrong”’ to weaken was not wholly due to there being only 
one right response among five. They do not eliminate that situation 
as a contributing factor, for the experiments were, in other ways, 
different from the earlier ones. However, a closer comparison fol- 
lowed. Lorge set up two additional experiments,’ experiments five 
and six, like experiment two of the group just referred to, except that 
in experiment five there were three rights and one wrong, and in 
experiment six there were one right and three wrongs. In experiment 
two there were two rights and two wrongs. Lorge concluded that 
“the higher the initial chance of right, the higher the value of the 
reward,”’ but refrained from drawing a conclusion as to the relation 
between the chances of wrong responses and the effect of ‘‘ wrong.” 
Errors are not reported, but there is a convincing consistency among 
the measures of the effect of ‘‘right.’”’ There is a suggestion in the 
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data that the more wrong responses there were the more “wrong” 
acted like a weakener. Merely because of the ambiguity of the results 
there is need for more information. But, quite aside from the question 
of reliability, Lorge’s experiments two, five, and six cannot fully 
answer our Question Two because, as pointed out above, our Question 
One has not been satisfactorily answered. The error in Lorge’s base 
line may have varied directly with the number of possible rights as 
specified in the directions, and hence may have accounted for his con- 
clusion. Furthermore, the study of the effect of ‘‘right”’ and “‘wrong”’ 
should not be complicated by the possible disruptive influence of the 
electric shock which accompanied ‘‘ wrong” in the Lorge experiments. 


III. PLAN OF EXPERIMENTS 


Materials —Nonsense syllables in multiple-choice arrangement, 
four responses per stimulus, were used throughout. Nonsense sylla- 
bles were used because fifteen hundred elements were needed; and 
because it was desirable to have the four responses on each line equal 
in association value. More important considerations were: The desire 
to investigate the effect of ‘‘right’”’ and ‘“‘wrong” where learning 
proceeds slowly; and the desire to investigate the assumption of 
chance repetition by the use of materials with a minimum of associ- 
ative favoritism. 

The syllables were taken from the Glaze list (1).* Six forms were 
made, each consisting of fifty lines, a stimulus syllable and four 
response syllables per line. The method was as follows: Beginning at 
the zero-association end of the Glaze list the first syllable was made 
the first stimulus syllable of Form I; the second syllable was made the 
first stimulus syllable of Form II; the seventh syllable became the first 
response syllable on line one of Form I; the eighth syllable became the 
first response syllable on line one of Form II; the thirty-first syllable 
was used as the fourth response syllable on line two of Form VI; 
the thirty-second syllable was used as the fourth response syllable on 
line two of Form V, etc. This method produced similar forms, 





* Krueger’s list® is superior in that the association values were determined on a 
much larger group, but the method and rate of exposure used by Glaze were much 
more like the method and time in our experiments. For a sample of two hundred 
fifty syllables, the Pearson product moment r between Glaze association values and 
Krueger association values is +.87. On the strength of this correlation one may 
claim that there is little basis of choice between the lists, but he may not claim that 
the Glaze values are highly unreliable. That there is a valid relation between 
Glaze association values and rate of learning has been shown by Stroud. 
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Cuart I.—PuLan or EXPERIMENTS 



































pee worms Trials Procedure 
ment used 
I 1-6 | Syllables are 1R3W—Subjects told when R or W 
I 
II 1-6 | Syllables are 1W3R—Subjects told when R or W 
R group, N = 20 W group, N = 20 
III 1-6 1R3W—Not told 1W3R—Not told 
II 
IV 1-6 1R3W—Told 1W3R—Told 
V 1 1R3W—Told 1W3R—Told 
V 2 1R3W—Not told 1W3R—Not told 
V 3 1R3W—Told 1W3R—Told 
V 4 1R3W—Not told 1W3R—Not told 
V 5 1R3W—Told 1W3R—Told 
III 
VI 1 1R3W—Not told 1W3R—Not told 
VI 2 1R3W—Told 1W3R—Told 
VI 3 1R3W—Told 1W3R—Told 
VI 4 1R3W—Told 1W3R—Told 
VI 5 1R3W—Told 1W3R—Told 
Group I, N = 20 Group II, N = 20 . 
I 1 2R2W—Not told 2R2W—Not told 
I 2 2R2W—Told 2R2W—Not told 
I 3 2R2W—Told 2R2W—Told 
II 1 2R2W—Not told 2R2W—Not told 
II 2 2R2W—Not told 2R2W—Told 
II 3 2R2W—Told 2R2W—Told 
IV 
Il, 1 2R2W—Not told 2R2W—Not told 
IIT, 2 2R2W—Told 2R2W—Not told 
III; 3 2R2W—Told 2R2W—Told 
IV; 1 2R2W—Not told 2R2W—Not told 
IV; 2 2R2W—Not told 2R2W—Told 
IV; 3 2R2W—Told 2R2W—Told 
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approximate equality on a line, broke up the alphabetical pattern in 
the Glaze list, and gave forms which might be conveniently divided 
into two sections, lines one to twenty-five, and lines twenty-six to 
fifty, of unequal association value. 

For use in Experiment IV, Forms III and IV were prepared in two 
additional arrangements. First the fifty lines were shuffled, then the 
four response syllables of each line were shuffled. One shuffling 
served for both Forms III and IV, so that Forms III and IV remained 
similar in their second arrangements and in their third arrangements. 
The original arrangements of Forms III and IV will be referred to as 
III, and IV, the first shufflings as III. and IVe, and the second 
shufflings as III; and IV3. 

Purposes.—The use made of the materials is indicated in Chart I. 
The purposes of the experiments were: 


Experiment I 


1. To obtain a basis from Forms I and II combined for equating groups R 
and W for experiments II and III, 

2. To set up an experiment (using Form I) similar to Thorndike’s experi- 
meuts,”1 and 

3. To make the comparison between Form I and Form II made by Lorge 
between his experiments six and five;® 


Experiment II 


1. To determine, with Form III, the amount of repetition when subjects 
are not told when right or wrong, (a) when trying to repeat rights (R group) 
and (b) when trying to avoid wrongs (W group), 

2. By the use of Form IV, to measure from these base lines or zero points 
the effect of ‘‘right’”’ and ‘‘wrong,”’ by subtracting repetition in Form III from 
corresponding repetition in Form IV; 


Experiment III 


1. To vary experiment II so as to make other types of subtraction possible, 
e.g., not only Form VI, Trial two, minus Form V, Trial two, but V 71 — VIT'1 
and V 73 — V T2; ) 


Experiment IV 


1. To find to what extent repetition, when the subjects are not told when 
right or wrong, is repetition of position, and to what extent it is repetition of 
syllables apart from position, 

2. To determine the bearing of the repetition of position upon the measures 
of the effect of “right” and “wrong” afforded by Experiments II and III. 
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Subjects—In Experiment I, ten graduate students in the Yale 
Department of Education, and thirty first-year students in the New 
Haven State Teachers College acted as subjects. Five of the ten and 
fifteen of the thirty constituted the R group for Experiments II and 
III. The other twenty acted as the W group. The forty subjects 
of Experiment IV did not participate in Experiments Ito III. They 
were all first-year students in the New Haven State Teachers College. 

Experimental Technique.—The experimenter spelled the stimulus 
syllable; the subject spelled and underlined a response syllable on that 
line. The experimenter said “right” or ‘‘wrong” (if giving that 
information was in order) and immediately directed attention to the 
next line by spelling the stimulus syllable for that line. Five or six 
trials were run per sitting. The time per line averaged about six 
seconds. ‘This is about as rapidly as the materials may be exposed if 
learning is to take place. Variations in rate were minimized by the 
pressure of a full schedule fitted into fixed college periods. One 
experimenter did all the work with the graduate students, and another 
did all the work at the Teachers College. 


IV. EXPERIMENT I 


On chance repetition as a base, the results with Form I were similar 
to the results of the Thorndike experiments. The potency of “‘right”’ 
as a strengthener was +6.3% + 1.6%.* ‘‘Wrong” failed as a 
weakener by —3.2% + .9%. The corresponding figures for Form II 
are +7.4 + 1.4 and —4.3 + 2.2. As in Lorge’s experiments, ‘“‘right”’ 
was more potent in the 1W3R situation than with the 1R3W form. 
However, the difference, 1.1, is insignificant, its SE being 2.1. The 
difference for ‘“‘wrong”’ is likewise insignificant, but in direction agrees 
with the tendency suggested by Lorge’s data. 

A combination of Forms I and II afforded a measure of learning 
from trial one to trial six, with a reliability coefficient of .62. With 
these scores an. R group was matched with a W group so as to divide, 
as equally as possible, the graduate group and the undergraduate 
groups working on Tuesday, Thursday and Friday. The difference 
in mean learning scores for the R and W groups is 3.9 + 11.0. The 
difference in standard deviations is 4.8 + 8.7. As a further check 
upon the equivalence of the R and W groups, the data of Experiment I 
were reworked for the two groups. For the R group the results are 





* All errors reported in this article are standard errors. 
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“right” +7.0, “wrong” —3.6; for the W group “right” +6.7, 
‘‘wrong’”’ —4.0. The differences are .3 and .4 with standard errors 
of 2.1 and 2.3. 


V. EXPERIMENTS II AND III 


Further Comment on Procedure.—After running Experiment I, a 
change was made in the procedure planned for Experiments II and 
III. Inspection of Experiment I records raised the question whether 
or not a subject, making very little sense out of the task of choosing 
one nonsense syllable to associate with another, isn’t very likely to try 
to discover a key pattern; or, for that matter, whether or not without 
such effort he is not likely to pattern his way down the page. A 
pattern on a multiple-choice page is easily remembered as any one 
knows who has scored multiple-choice tests. The implications for 
Experiment II were important. A subject might pattern his way 
through Form III, undisturbed, repeating patterns from trial to trial. 
Told when right or wrong with Form IV he might modify or drop his 
patterning and, as a consequence, ‘‘wrong’”’ would be credited with 
having brought about many more changes of response than it had 
specifically effected. This question should have been answered experi- 
mentally (as was.later planned in Experiment IV), but at the time the 
writer thought only of trying to reduce patterning to a minimum. It 
was decided not only to attempt this with the directions, but by 
reading to the R group the syllables they were to underline, and to the 
W group the syllables they were to avoid underlining. The effort 
was to create a mental set favorable to attention to individual syllables. 
The amount of possible memorization was minimized by scattering 
the fifty pertinent pairs of syllables among fifty others, avoiding 
altogether the first and last ten positions on the reading list. As a 
result of the reading, the initial average scores of number right were 
14.5 instead of the expected 12.5 for the R group, and 38.1 instead of 
the expected 37.5 for the W group. 

The instructions used for the W group with Form III follow: 


INSTRUCTIONS 


Third Day—Form III 
| Group W 


Today’s experiment is quite different from the first and second. Although 
the materials are similar, the purpose today is to study retention and 
forgetting. 
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I shall begin by reading to you one hundred pairs of syllables, fifty of which 
will be the fifty stimulus syllables of Form III followed in each case by the 
syllable in that line which you are to avoid. I shall read the list but once. 
After I have read it, we shall go through Form III six times, as on previous 
occasions, except that I shall not say whether you are right or wrong. On each 
line one syllable, the one I shall have read, will be wrong and any of the other 
three will be right. You are to spell a syllable aloud and underline it, but you 
are to avoid the one I shall have read to you. 

I want to find out whether your score increases or decreases from trial 
to trial. On first thought, one might say that your score would go down 
because of forgetting, but it is quite possible that, as you get used to the 
syllables, some of your vaguer memories which have not functioned in the 
earlier trials will tell you in the later trials which syllable not to spell and 
underline. 

To find out whether this is the case, look at each of the syllables, in order 
to give each faint impression a chance to exert its influence. The syllables 
are not on the paper in the order in which I shall read them, so don’t try to 
depend upon position. In fact, don’t try to use any system. Any such 
attempt would only detract from the functioning of the faint impressions 
which are being studied. 

On each line you are to avoid spelling aloud and underlining the syllable 
you think I shall have read to you. 

(To the experimenter: After reading the list, have the subject turn over 


the paper and then repeat the last sentence of the above instructions before 
starting.) 


It will be seen from these instructions that the W group procedure 
was sharply contrasted with the customary emphasis upon the repeti- 
tion of a right response. Practical considerations dictate the experi- 
mental evaluation of this contrast. In many situations, children are 
told not to do a certain thing. ‘‘ You may do this, or do something 
else, but don’t do that.’”’ ‘‘Go almost anywhere else, but don’t go 
there.”’ It is recognized, of course, that in this matter the procedure 
of Lorge® is not being exactly duplicated, but it seems to the writer 
more profitable to evaluate “‘right’”’ and ‘‘wrong” in the procedure 
used here. 

Results.—The facts with regard to repetition when the subjects 
were not told when right or wrong are given in Tables I and II. 
There is a suggestion in Table I that the more meaningful syllables 
were repeated more frequently than the less. There is also a sugges- 
tion that the R group, looking for a right, repeated more than did the 
W group, trying to avoida wrong. But clearly, the difference between 
the R and W group procedures had a very slight effect upon the amount 
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of repetition. The results for the two groups are combined for greater 
reliability in Table II. The “stronger” responses made in trials one 
and two were repeated just about as frequently as the ‘‘ weaker” 
responses were repeated in Stephens’ experiment.'® The responses 
made for the first time in trial five were repeated more often than 
would have been expected by chance, but not reliably more. The 
probabilities are that first position responses were repeated least 
frequently, for position one was the least favored position. On trial 
one with Form III the distribution of the two thousand underlinings 
among positions one to four was 19.6, 27.6, 30.0 and 22.9 per cent. 
On trial six it was 19.5, 28.3, 31.2 and 21.1 per cent. The shifts in 
favoritism are probably responsible for the differences in the per- 
centages of repetition given in Table I for the four positions. But the 
point to be made here is that even in the least favored position, 


TaBLeE [.—PerR Cent REPETITION ON Forms III, V ann VI 
(Average of All Measures Used as a Base) 


ls ane Abe cOhd 6 hi na ose Mea we 40.5 + 2.5* 
On items 26-50................. Teer, = OC Ce, 
ET Ee Te ST ee eee eee ee 42.3 + 2.5 
LOTTE TE, 
* All errors reported in this article are standard errors. 
TaBLe I].—PerR Cent REPETITION ON Form III 
(Subjects Not Told When Right or Wrong) 
Of responses made in trial 1........................ 35.4 +1.7 
Of responses made in trial 2, but not before........... 36.8 + 1.5 
Of responses made in trial 3, but not before........... 33.4 + 2.3 
Of responses made in trial 4, but not before........... 31.1 + 2.5 
Of responses made in trial 5, but not before.......... . 28.1 + 3.6 
Of responses made in trial 2, 3, and 4, but not before.. 33.8 + 1.2 
Of first position responses made in trials 2, 3, and 4, 
ee hide ie as oe ai 31.6 + 2.0 
Of second position responses made in trials 2, 3, and 4, 
re hr ea oe «ihe bag a ak wa 34.2 +1.8 
Of third position reponses made in trials 2, 3, and 4, 
EEE Oa Ree ree eM o.8 & 8.7 
Of fourth position reponses made in trials 2, 3, and 4, 
se eS i a's tn a6 ag ode ed sak 32.9 +1.9 


responses were certainly repeated more often than would have been 
predicted by chance. This fact may not be attributed to the reading 
of the syllables to be chosen by the R group. To increase reliability, 
these syllables were included in the determination of the percentages 
of Table I, but without them, the percentages of repetition for posi- 
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tions one to four are 31.4, 33.8, 36.2, and 33.8. To have assumed 
chance repetition of the reponses made in trials two, three, or four, 
but not before, would certainly have introduced an error. On the 
basis of that assumption, the potencies of “‘right’’ and “‘wrong”’ in 
the Form IV data would have been reported as +11.6 and —1.6. 
On an experimentally determined base they are +2.7 and +7.0. 

In Table III are listed ten different evaluations of the potencies of 
“right” and ‘‘wrong”’ measured from experimentally determined 
zeros. ‘The measures are independent with one exception. Form V, 
trial two, was used as a base for two computations. ‘The measures are 
grouped, because of similarity, into four groups. ‘The same data are 
regrouped in several ways in Table IV. None of the measures of the 
potency of “wrong” are minus. The average measure is signif- 
icantly above zero. So far as chance errors are concerned, it may be 
said, safely, that “‘wrong”’ exercised the weakening influence tradi- 
tionally attributed to it. In fact, ‘‘wrong” is more potent than 
“right” by nine of the ten methods (Table III) and in nine of the 
twelve ways in which the data are divided in Table IV. The differ- 
ence is 4.0 + 1.3. . 

There are, in the last row of Table IV, slight differences of the sort 
noted in Experiment I, and by Lorge.’ ‘‘Right’’ seems to be more 
potent in the situation in which the subject hears “‘right’”’ three times 
as often, and ‘‘wrong”’ seems to be more effective in the situation in 
which the subject hears “‘wrong”’ more often. However, in Experi- 
ments II and III, as in Experiment I, the differences between the 
R and W group procedures had relatively slight effect upon the evalua- 
tion of “‘right’”’ and ‘‘ wrong.” 

The method of computation throughout, was to count cases for 
groups of five subjects on twenty-five items in all four positions; to 
express these counts of repetition and of change as percentages; and 
then to average the percentages. This was done so as to make Table 
III possible. With it the reader is free to weight the ten approaches 
as he sees fit. What difference would it have made if another method 
of assembling the data had been followed? The question has been 
answered for Experiment II by counting cases, for the whole R group 
and W group separately, for each position, converting into percentages 
and averaging. This means, in general, combining data where before 
percentages were averaged, and averaging percentages where before 
numerators and denominators were combined. The results are shown 
in Table V, method number two. The potencies are both higher than 
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for the same data by the method used in the preparation of Tables 
III and IV (method number one). 


TaBLE III.—Tue Porency or “‘ Ricgut’”’ anp “ WRonG”’ 






































Right Wrong 
In Trial 2 not before, Form IV — Form III........... 3.5 9.1 
In Trial 3 not before, Form IV — Form III...........| 1.0 3.8 
In Trial 4 not before, Form IV — Form III...........| 3.5 8.2 
ht ine hin dbaled bhiee es eek LAs eA SSN 2.7 7.0 
All responses, Form IV Trial 1 — Form III Trial 1....| 5.4 3.2 
All responses, Form V Trial 1 — Form VI Trial 1...... 3.8 11.1 
cies daha ASAD eae oak kee ee 4.6 7.2 
All responses, Form IV Trial 5 — Form III Trial 5....| 3.8 10.3 
All responses, Form VI Trial 4 — Form V Trial 4..... 2.8 9.3 
EE eR re ne re 3.3 9.8 
All responses, Form VI Trial 2 — Form VI Trial 1.....) 5.8 9.9 
All responses, Form V Trial 3 — Form V Trial 2...... 8.9 13.3 
All responses, Form VI Trial 3 — Form V Trial 2..... 10.6 10.9 
REA Sata ee FOr er ney See re 8.4 11.4 
inc chan eek oe 6s 46 Ne Ese d Oe 4.91 + 87 8.91 + .95 





Another question of method concerns the computation of the base, 
or control percentage. Should all the data be combined so that the 
percentage of change is one hundred minus the per cent of repetition? 
Or should the per cent of repetition be based for the R group only 
upon the syllables which they were told were right, and for the W 
group on those syllables not read to them? The latter course was 
followed. In all cases, the control percentage of repetition was based 
on rights and the control percentage of change was based on wrongs. 
This course was thought necessary in order to eliminate the effect of the 
reading from the measures of potency. In Table V, the reworked 
Experiment IT results are shown by both methods. The common base 
could raise or lower the potencies in any combination by a reduction 
of chance error. Because of the reason given for not using it, it 
might be expected in general to increase the potencies by including 
the effect of both the reading and the later announcements of “‘right”’ 
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and ‘“‘wrong.” Actually it raised the potency of “right” and lowered 
that of “wrong.” 

The consideration of the possible error involved in the evaluation 
of potencies because they are based on a non-informed control group 
is the subject of the next section. 


VI. EXPERIMENT IV 


Since the R and W group procedures made so little difference in 
Experiments I to III, the division into R and W groups was dropped 


TABLE I[V.—TuHeE Porency or “ Ricut’”’ anp “‘ WrRonG”’ 
(Regrouping of Data in Table IIT) 
































Right Wrong 
EF re 3.4 6.9 
i i PW pack diac sed sen cee es seccanves 6.4 10.9 
i ns ns CREASE RA Ae CRED RE OE OS 5.2 6.6 
Ee Ere eer 4.7 11.3 
R group W group R and W groups 
Right Wrong Right Wrong Right Wrong 
T.C. Tuesday...... 8.0 5.7 5.2 7.3 6.6 6.5 
T.C. Thursday..... 2.2 11.9 3.9 9.9 3.0 10.9 
ee I cc r eon .e 7.3 3.6 3.6 5.6 5.4 
of ere 6.0 8.3 4.2 6.9 5.1 7.6 
OS a d-drsaceeh ae ae 13.4 8.7 12.5 4.4 12.9 
T.C. and grad...... 4.5+1.5) 9.641.15.3+1.0) 8.384+2.1/4.9+ .9 | 8.94+1.0 























for Experiment IV. Groups I and II in this experiment served 
alternately as control and experimental groups, and there were, 
throughout, two right and two wrong responses. 

Through lack of foresignt, no allowance was made for the inexperi- 
ence of the subjects used in Experiment IV. If anything, the time 
per line was decreased, for a slightly more ambitious program was 
undertaken. At any rate, there was no evidence of learning with 
Forms I and II (initial score 24.86 and final score 24.83) and very 
little with Forms III and IV (initial score 24.89, final score 25.28). 
The resulting data are no less satisfactory for the first purpose of 
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Experiment IV, and fortunately serve the second purpose indirectly. 
Instead of showing measures of the potency of “right” and “‘ wrong”’ 
apart from interference with general repetition habits, the data show 
the influence of interference with such habits apart from the potency 
of ‘“‘right”’ and “‘wrong”’ as specific contributors to learning. 


TaBLeE V.—TuHE Errect oF DIFFERENT METHODS OF ASSEMBLING DaTa 
Data from Experiment II 
(Forms III and IV) 























Method Potency of right Potency of wrong 
1. Combining positions, averaging per- 
centages for small groups. + 3.4 + 6.9 
2. Combining from small | Position 1 + 2.0 +13.5 
groups, averaging per-| Position 2 + 4.5 + 8.1 
centages for positions. | Position 3 + .7 + 6.9 
Position 4 +12.8 + 5.9 
Averages + 5.0 + 8.6 
3. Same as No. 2 except} Position 1 + 3.0 +12.2 
that bases for “right’’ | Position 2 + 4.6 + 8.0 
and “‘wrong”’ were from | Position 3 + 1.4 + 8.2 
the same data. In No. | Position 4 +12.8 + 5.7 
1 and No. 2 they were - 
independent. Averages + 5.5 + 8.5 














A list of pairs of syllables was read as in Experiments II and III, 
but from this reading there was no learning. The initial score for 
control and experimental groups combined was 24.8 + .3, instead of 
the expected 25.0. 

Responses on Forms I and II made in trial two for the first time, 
were repeated in 39.2 + 1.9%* of the cases in which the subjects were 
not told when right or wrong. On Forms III and [V, in which the 
arrangement was changed from trial to trial, the location of the 
response was repeated 30.0 + 1.6% of the time. The repetition of 
syllables, regardless of position, was 31.7 + 1.6%. The conclusion 
drawn is that repetition such as reported for Form III in Experiment 
II is not wholly a matter of position or patterning. If anything, it is 
due more to a repetition of syllables than to position and pattern. 





* The comparable per cent reported in Table II is 36.8. 
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There is justification for this opinion in the correlation of —.01 + .16 
between repetition in Forms I and II (Experiment IV) and repetition 
of position on Forms III and IV (Experiment IV) as compared with 
the correlation of +.59 + .10 between the former and repetition of 
syllables regardless of position. The thirty-one and seven-tenth per 
cent repetition of shuffled syllables is four and two-tenth times its 
standard error above the chance twenty-five per cent which might 
have been assumed. Had the learning not been negligible, the second 


TaBLE VI.—Errect or “ Ricut”’ anp “‘WRONG”’ IN EXPERIMENT IV 
(Practically Zero Learning) 





























Per cent | Per cent 
repetition | change of 
of rights wrongs 
Based on responses | Forms I and II (not shuffied) told 37.9 64.5 
made in trial 2| FormsI and II (not shuffled) not told 40.2 61.8 
but not in trial 1. — 
Effect of “‘right’’ and “‘wrong”’ —2.3 +2.7 
Based on all re-| Forms I and II (not shuffled) told 44.2 59.1 
sponses made in | Forms I and II (not shuffled) not told 45.8 54. 
trial 2. —- 
Effect of ‘‘right’”’ and ‘‘wrong”’ —1.6 +4.6 
Based on responses | Forms III and IV (shuffled) told 29.0 67.0 
made in trial 2| Forms III and IV (shuffled) not told 28.6 65.2 
but not in trial 1. —- — 
Effect of ‘right’ and “‘wrong”’ + .4 +1.8 
Based on all re-| Forms III and IV (shuffled) told 34.9 65.7 
sponses made in | Forms III and IV (shuffled) not told 34.3 63.1 
trial 2. —-——- 
Effect of “right” and “‘wrong”’ + .6 +2.6 














half of Experiment IV would have yielded, instead of the unreliable .4 
and 1.8, or .6 and 2.6 in Table VI, measures of the potency of ‘‘right”’ 
and ‘‘wrong”’ less affected by change of pattern. Instead, the data of 
Experiments II and III will be corrected with the results in the first 
half of Table VI to give estimated potencies unaffected by changes of 
pattern, position, or other general repetition habits. However, the 
Form III and IV results, though unreliable, are such as might be 
predicted on the assumption that the shuffling minimized the appear- 
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ance of spurious potencies and that there was a little learning with 
Forms III and IV. 

Since there was no evidence of learning with Forms I and II, 
rights and wrongs were combined to give the most reliable measure 
of the extent to which being told when right or wrong reduced repeti- 
tion. When not told, the subjects repeated four hundred forty-six 
of the eleven hundred thirty-four underlinings in trial two but 
not in trial one. This is 39.33 + .77%. While being told, they 
repeated four hundred twenty-six times out of eleven hundred 
seventy-three. This is 36.32 + .76%. The reduction is 3.01 + 
1.08%. Of the full two thousand responses made in trial two, nine 
hundred eighteen were repeated under control conditions and eight 
hundred fifty-three under experimental conditions. This reduction 
is 3.25 + 1.57%. 

If this reduction in repetition is due in its entirety to a greater 
attention to individual syllables when the subjects were being told 
when right or wrong, then it is a correction which should be applied to 
increase the potency of ‘‘right”’ and decrease the potency of ‘‘ wrong”’ 
as they were reported in Table III. The results of such a correction 
are: For ‘‘right’’ 7.92 + 1.39 and for “wrong” 5.90 + 1.44, or for 
“right” 8.16 + 1.79 and for “wrong” 5.66 + 1.84, according to the 
correction used. By neither comparison is the difference between the 
potencies greater than its standard error. In both cases the value 
for ‘‘wrong”’ is more than three times its standard error. 

The above statement presents the case for ‘‘ wrong’ at its weakest, 
for there is some reason for thinking that the corrections were due in 
part to a difference between the experimental and control groups. 
Trial one constitutes a test, since differentiation did not begin until 
trial two. For trial one, on Forms I and II, control group repetition is 
43.30%, experimental group repetition is 41.45%; on Forms JII and 
{V, control group repetition is 34.75%, and experimental group repeti- 
tion is 34.10%. Altogether there is a difference here of 1.25 + 1.08% 
between groups. If the correction is reduced by this amount, the 
potencies of ‘‘right’”’ and “‘wrong’”’ become 6.67 and 7.15 or 6.91 
and 6.91. 

At this point, no attempt will be made to answer the question, 
‘“‘Ts ‘right’ in general more potent than ‘wrong’?’”’ ‘The question can 
only be answered from evaluations of ‘‘right’”’ and ‘‘wrong,” with 
many variations. Even in similar tasks ‘‘right’’ spoken with one 
inflection, or in a certain manner, might do one thing, and with 
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another inflection have a different effect. ‘‘Wrong’’ spoken by one 
person might be more effective than when spoken by another. 

A difficulty in the problem area of this article, which should be 
mentioned, is the unit of measurement. If, at the beginning, responses 
are made twenty per cent of the time, have “right” and “wrong” had 
an equal effect if “right” raises the per cent to thirty and “‘wrong”’ 
reduces it to ten? Or, must ‘‘right”’ raise the per cent to one hundred 
in order to equal the potency of “‘wrong” in lowering it to zero? By 
the latter supposition, lowering to ten would equal raising it to sixty. 
The writer has perhaps minimized the importance of this question by 


setting up difficult learning situations, but the question remains 
unanswered. 


VII. SUMMARY AND CONCLUSIONS 


In Experiment I the procedure used was shown to be similar to 
that used by Thorndike?! in that it yielded similar results. 

In Experiments II and III described in this article the assumption 
of chance repetition as a base would have introduced a serious error. 
From an assumed base, “‘ wrong” would have been appraised as hav- 
ing done more harm than good. Measured from an experimentally 
determined base (in these experiments the base was repetition when 
not told anything), the contribution of ‘“‘wrong”’ to learning was 
certainly positive, and not less than that of ‘‘right.”” Whether the 
subjects were told to avoid one response among four, or to repeat one 
among four, had very little effect, but the demonstration of more than 
chance repetition of nonsense syllables makes questionable the 
assumption of chance repetition of any materials. 

Experiment IV was an effort to answer the question, “‘ Are the 
potencies measured in Experiments II and III partly spurious?” In 
other words, ‘‘ Does the fact that the subjects are being told when 
right and wrong affect their repetition of responses apart from the 
specific contributions of “right” and “wrong” to learning?” The 
results of the experiment suggest that announcements reduced 
repetition. The application of the most valid of the obtained measures 
of this general effect as a correction to the results of Experiments II 
and III increased the potency of ‘“‘right’’ and decreased that of 
“wrong’’ so that they were about equal. At best, “right”? contrib- 
uted more than “‘wrong” by a difference which was about equal 
to its standard error, and, at least, ‘‘wrong” contributed to learning 
by an amount equal to 3.1 times its standard error. 
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ACQUISITION AND RETENTION OF FACTUAL 
INFORMATION IN SEVENTH-GRADE GENERAL 
SCIENCE DURING A SEMESTER OF 

EIGHTEEN WEEKS 


AUBREY H. WORD AND ROBERT A. DAVIS 
University of Colorado 


This study, which is a part of a more extensive investigation, is an 
effort to throw light upon the nature of schoolroom learning including 
the acquisition of factual information during the progress of a course 
in general science and its retention during the period of instruction 
as well as over more extended intervals. 

The investigation, which was conducted with three sections of 
seventh-grade general science pupils in the public schools of Boulder, 
Colorado, involved administering at regular intervals a series of 
specially designed objective tests. In collaboration with the regular 
classroom teacher a careful analysis was made of the textbook to be 
used, three measurable objectives formulated for the course, and a 
study outline for the semester constructed. With the important 
facts and principles checked in the textbook as a basis, three types of 
tests were designed to measure the extent of attainment of the objec- 
tives. The present study deals with the ability of pupils to recall 
specific factual information as evidenced by their responses to simple 
completion items. 

Six hundred thirty items were prepared and arranged under two 
forms, A and B, the odd numbered items in Form A and the even 
numbered in Form B. Each form was then divided into nine parts, 
so as to be nine tests of thirty-five items each. In accordance with this 
division of items the material in the textbook was divided into nine 
approximately equal units. Pupils were informed in advance as to 
the exact number of pages to be studied during each two-week period, 
but were not informed of the purpose of the study or motivated in any 
special way other than to be informed of their test scores. On the 
first day of the semester mimeographed sheets stating the objectives 
of the course and a study outline were distributed. Instruction pro- 
ceeded as usual under the supervision of the regular teacher and all 
tests were administered by him; thus the conditions of the study were 
typical of the normal classroom. 

The testing program began at the end of the first two weeks of 
instruction by administering Form A of Test I. This test contained 
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thirty-five items based upon the material studied during this period. 
At the end of four weeks, or two weeks following the first testing, 
Form A of Test II and Form B of Test I were administered as a single 
test of seventy items so intermingled as to prevent pupils, at least 
partially, from detecting the presence of material on which they had 
been tested two weeks previously. Beginning with the second testing 
two separate scorings were necessary, one in order to determine the 
amount of new material acquired during the two-week interval imme- 
diately preceding the test and the other to gauge the amount retained 
from the two-week period already tested two weeks earlier. Thus, 
throughout the semester of eighteen weeks, acquisition and retention 
were measured simultaneously, with the exception of the first two 
weeks. Inasmuch as no supplementary reading or laboratory work 
were required and all material presented in class was taken from the 
textbook, it was assumed that the items of the several tests had a 
satisfactory degree of validity of content. The coefficients of relia- 
bility for the nine tests ranged from .809 to .895, the average coefficient 
being .859 + .018. 

The measurement of acquisition and retention simultaneously made 
possible a determination of cumulative progress during the semester. 
In order to determine the sustained effects of acquisition, the final 
examination was composed of the three hundred fifteen items con- 
tained in Form B of the nine tests. By dividing these items into 
groups of thirty-five and scoring each separately, retention over inter- 
vals ranging from two to sixteen weeks was calculated. Acquisition 
scores for each pupil on each of nine tests were recorded as well as his 
retention scores after intervals of sixteen, fourteen, twelve weeks, and 
soon. Differences in performance were noted and the amount of loss 
or gain for each two-week interval determined. 

The results of the study fall into two main divisions: The results 
dealing with the study of acquisition and retention simultaneously, 
and the results which show retention over varying intervals. 


1. THE STUDY OF ACQUISITION AND RETENTION SIMULTANEOUSLY 


In Table I are presented both the initial acquisition scores and the 
retention scores for two-week intervals throughout the semester. 
The distributions were subjected to the chi-square test,!:? using the 





‘Garrett, Henry E.: Statistics in Psychology and Education. New York: 
Longmans, Green & Co. 2d edition, 1937, pp. 119-124; 377-387. 

? Fisher, R. A.: Statistical Methods for Research Workers. Edinburgh and 
London: Oliver and Boyd, 6th edition, 1936, chapters 3 and 4. 
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obtained acquisition scores as “‘expected”’ results and the obtained 
retention scores as ‘‘observed”’ results, in order to determine whether 
the differences might be due to sampling errors or chance. The end 
frequencies were combined so as to eliminate the zeros, with the 
following results: 








Chi-square P 

nia Wik ie dinate ies ese ae eles eaNe 23.614 less than .01 
i ee ue ae he ati aod Binet ha 154.950 less than .01 
ee eee oe ee ee ea aes 67 .346 less than .01 
CN 655 SG EUNSENAGAN KS aedee eee eaw kes 57.252 less than .01 
ihe aD eee dé ehh ae Kea a elke nea een 133.958 less than .01 
ie ee dake bh oebeens ke be he eee 33.571 less than .01 
lit aa eck ad ead aaa s o eae se Sa 46.042 less than .01 
i ie ee ae gid earthen wee Oban 20.318 .029 











In a similar manner obtained retention scores following varying 
intervals of time were used as ‘‘observed”’ results and acquisition 
scores based on the same units of work used as ‘‘expected”’ scores with 
the following results: 








Chi-square P 

NS oe hand se hoes ee enreeee 20.895 .020 

Ne ho 6a a eae a eh icdeheanes eens 72.493 less than .01 
Te ia a oe ka aks oe ieee ee waa 136.086 less than .01 
ee as kahit adap sii eile a adh bole we ew AS KOM 45.134 less than .01 
ae aa a ee bee ed 99.873 less than .01 
ie koe ee ete eee ee ae geese 40.807 less than .01 
oss os eh ac beds khenw en ees taerene 35.253 less than .01 
ae a a 6s ke lr cian gh GN hs alae a aw ee 106.192 less than .01 
ei ae is i hae he eae Rae Oe 33.769 less than .01 











These data indicate that there is small likelihood that the results 
tested could have been obtained by chance. The results of the chi- 
square test would probably have been more valid if the sampling had 
been larger. Since the measurements of acquisition and retention 
were made at different times, it unfortunately happened that the 
same numbers of pupils were not always subjected to both testings; 
for example, on a same unit of work, ninety-six pupils were tested for 
retention whereas ninety-four were tested for acquisition. 

Table II is a summary of Table I and shows the means and standard 
deviations for the two sets of data. Perhaps the first observation that 
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may be made of the data of Table I is the wide individual differences 
revealed in the acquisition and retention scores. These differences are 


TABLE I.—AcQUuUISITION AND RETENTION ScorRES FOR TWoO-wEEK INTERVALS 
DURING THE SEMESTER OF EIGHTEEN WEEKS 




















Unit Unit Unit Unit Unit Unit Unit Unit Unit 
e I Il III IV V VI VII VIII Ix 
3 
D 
Acq.|Ret.|Aoq.| Ret .| Acq.|Ret.|Acq.|Ret.|Acq.|Ret.|Acq.|Ret.|Acq.|/Ret.|Acq.|Ret.|Acq.|Ret. 

33 
31 — 3 ss ie 1 2 1 
BD fa 4 3 3 | eS ree ee 0 ss 3 0 1 
27 5 1 0 1 2 3 2 2 1 2 2 0 2 3 1 
25 | 10 3 7 4 4 1 1 0 1 BE es 0 1 4 6 8 2 
23 7 7 5 4 1 5 3 4 3 7 1 1 3 3 4 2 6 
21 | 10 9] 10 3 6 8 7 4 1 9 1 4 5 2 8 7 3 
19 | 11 | 12 9 6 5 | 16 | 14] 10 7 9 5 2 4/15 7 6 3 
17 | 20; 19 | 10 9} 15 9/13] 11 6 | 12 2 4/11} 12 8 9 6 
15 8 | 10]12;13{ 12] 138j11 1] 15 5 | 17 2 6; 12]111] 13 9 9 
13 11 | 16] 16} 11 6} 14] 18} 12] 15 9 6 4/11 4/14; 15 7 
11 4/11 7 9/18/11] 11 8 | 13 7110) 16 | 14] 15 7 | 13 8 

9 3 3 8 8 7 5 | 11] 13 7 7 | 16 y 8 8 | 16} 11 | 14 

7 4 3 5 9 6 4 1 7118 5 | 19 | 15 8 9 4 9/13 

5 1 8 7 3 2 2 7 2; 13 9 2 5 2 3 | 13 

3 ea 4 3 7 4 8 =a as 14 5 2 _ 1 6 

Ps ros Fe 1 Pe Pen eee Te 1 7 4 te : 1 5 
N | 93 | 94 | 94 | 96 | 96 | 94 | 94 | 92 | 92 | 93 | 93 | 90 | 92 | 93 | 93 | 94 | 96 



























































TaBLE I].—MEaANs AND STANDARD DEVIATIONS FOR ACQUISITION AND RETENTION 
FOR TWO-WEEK INTERVALS DURING THE SEMESTER 














Means Standard deviations 
, Changes 
Unit in means 
Acquisition | Retention Acquisition Retention 
I 18.43 16.78 —1.65 5.14 4.46 
II 16.82 14.61 —2.21 5.71 7.08 
III 14.52 16.35 +1.83 6.36 5.38 
IV 15.59 14.24 —1.35 4.63 5.33 
V 11.78 16.08 +4.30 5.58 5.84 
VI 8.81 10.08 +1.27 4.93 5.92 
VII 14.02 14.83 +0.81 6.64 6.01 
VIII 15.82 14.48 —1.34 5.78 5.74 
IX Ss, eee 6.46 




















even more pronounced than would usually occur because three sections 
of pupils roughly graded as to ability and previous achievement have 
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been treated compositely. The standard deviations in Table II make 
possible some interesting comparisons of the degree of individual differ- 
ences in acquisition and retention and throw some light on the question 
as to the influence of identical training upon individual differences. 
Do individual differences increase or decrease as pupils progress 
through the semester? And are there significant differences in this 
respect between acquisition and retention scores? The facts are 
interesting, but the fluctuations in standard deviations are too marked 
to justify generalizations. The study of the influence of identical 
training upon individual differences is also complicated by the fact 
that there is no assurance that the various units of subject-matter of 
the course are comparable in difficulty, even though the amount of 
factual material covered in each unit is approximately identical. 

Although it is impossible to determine the cause of the fluctuations 
in mean acquisition scores from one testing period to another, they 
are probably due to the relative ease or difficulty of the various units 
of subject-matter. The introductory chapters of the textbook are 
comparatively easy and interesting, whereas the units dealing with the 
nature of matter, the electron theory, chemical symbols, formulas, 
and equations, even when treated in an elementary manner, are diffi- 
cult for junior-high-school pupils. It appears that the units containing 
descriptive material are acquired more readily than those in which 
the facts are functional. Distracting influences, whether they be 
extracurricular activities or out-of-school interests, contribute to the 
fluctuations in efficiency. It is unlikely, however, that such factors 
operate for all pupils during the same intervals; and since approxi- 
mately the same amount of time was devoted to class discussion of each 
unit of material, the difficulty of various units may be assumed to be a 
determining factor in the initial degree of mastery. 

Of greater significance from the standpoint of this study is a com- 
parison of the mean acquisition and retention scores. It will be noted 
that losses were found for Units II, IV, and VIII, whereas gains were 
found for Units III, V, VI, and VII. It may be observed that higher 
acquisition scores were followed by lower retention scores on subse- 
quent testing, but that the lower acquisition scores were followed by 
varying degrees of gain. This may be partially explained by the fac‘ 
that the typical pupil is anxious. to improve his score and consequently 
spends considerable time in review. Conversely, a pupil who has made 
a creditable score no doubt devotes little or no time to review of mate- 
rial previously acquired. It also seems plausible that some of the 
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improvement upon subsequent testing is due to integration and inter- 
dependency of subject-matter in progressing from one unit to the 
next. The textbook employed in these general science classes shows 
integration and correlation of subject-matter to an unusual degree. 
Frequently, items which seem difficult or obscure at the time of pres- 
entation become clarified after they have been related to subsequent 
facts and principles. Some pupils ruminate upon previously acquired 
material, whereas others may devote little attention to items after 
their initial acquisition. 
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Fig. 1.—Acquisition and retention scores plotted cumulatively. 


Still another way to show the results obtained from a study of 
acquisition and retention simultaneously is to add cumulatively the 
means for the two sets of data for each testing period and obtain the 
curves of Fig. 1. These curves show not only the number of facts 
acquired throughout the semester but the degree of reminiscence and 
forgetting at various testing periods. 


2. THE STUDY OF RETENTION OVER VARYING INTERVALS 


By checking the scores on each successive group of thirty-five items 
in the final examination it was possible to determine the amount 
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retained over intervals ranging from five days to sixteen weeks. The 
distribution of retention scores for these varying intervals is presented 
in Table III, with the means and standard deviations summarized in 
Table IV. Table IV shows that Units II, IV, VI, and VIII suffered 
losses whereas Units I, III, V, and VII made gains. No definite 
relationship appears between the amount retained and the recency 


TABLE III.—RETENTION SCORES OVER INTERVALS VARYING FROM SIXTEEN WEEKS 
To Five Days 














Unit | Unit | Unit | Unit | Unit | Unit | Unit | Unit | Unit 
I II III IV V VI VII | VIIT| IX 
Score 
16 14 12 10 8 6 4 2 5 
weeks | weeks | weeks | weeks | weeks | weeks | weeks | weeks| days 
33 
31 ¥ 1 we 1 ey 1 
29 1 2 1 1 1 0 
27 4 5 - 0 3 0 1 1 
25 5 6 3 2 6 2 1 1 2 
23 10 4 11 2 5 2 5 7 0 
21, 18 5 7 3 6 1 8 5 3 
19 14. 7 12 14 9 4 10 13 3 
17 11 7 8 11 12 2 9 8 5 
15 10 6 10 11 8) 5 17 7 4 
13 10 6 6 13 3 6 15 12 5 
11 4 16 7 8 8 7 6 10 6 
8) 6 11 12 11 5 il 8 6 16 
7 1 9 3 6 5 5 4 5 9 
5 3 6 4 4 7 6 6 7 
3 1 5 4 2 16 2 5 10 
1 4 3 1 6 11 1 4 12 
N 94 93 93 91 85 80 94 90 82 
































of the material studied; however, for the first eight testing periods the 
greatest loss was found in Unit VIII. This is probably due to the 
fact that little time remained following the test in which to assimilate 
the material of this unit. Since less than one week elapsed between 
administration of Test 1X and the final examination, this condition 
was more pronounced, as is attested by a loss of 3.33 points during the 
interval of five days. A slight improvement in score on material 
included in Unit I after a lapse of sixteen weeks indicates that the 
amount of loss is not necessarily a function of the length of time 
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Nor can it be assumed that the higher initial scores will be 


followed by greater losses, for the acquisition score on Unit I was the 


TaBLE IV.—MEANS AND STANDARD DEVIATIONS FOR RETENTION SCORES FOR 
INTERVALS VARYING FROM SIXTEEN WEEKS TO Five Days 

















Means Standard deviations 
' Changes 
Unit | Interval io sane 
Acquisition | Retention Acquisition | Retention 
I | 16 weeks 18.43 18.67 +0.24 5.14 4.92 
II | 14 weeks 16.82 15.03 —1.79 5.71 7.34 
III | 12 weeks 14.52 14.70 +0.18 6.36 6.64 
IV | 10 weeks 15.59 14.09 —1.50 4.63 5.56 
V 8 weeks 11.78 15.57 +3.79 5.58 7.44 
VI | 6 weeks 8.81 9.43 +0.62 4.93 6.78 
VII | 4 weeks 14.02 14.99 +0.97 6.64 5.70 
VIII | 2 weeks 15.82 13.94 —1.88 5.78 6.44 
IX | 5 days 11.50 9.40 —2.10 6.46 6.14 




















highest of all nine tests; yet after the longest lapse of time a gain was 
found. On the other hand, Unit VI showed the lowest mean initial 
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Fic. 2.—Reminiscence and forgetting during a semester of eighteen weeks. 
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score and after six weeks a loss was noted. Since the degree of initial 
mastery does not appear to affect the amount retained during varying 
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intervals, the nature of the material again seems to be the most 
plausible factor conditioning these results. Figure 2 graphically 
compares the means of Table IV. 

In order to minimize the effect of additional learning or relearning, 
no formal review was held in preparation for the final examination. 
Nevertheless, some pupils probably did review certain parts of the 
material. All that may be said is simply that the inability to control 
the quantity or quality of study in a classroom is an inherent limitation 
of classroom experimentation. In the laboratory both the learning 
and testing may be controlled, but in the classroom it is usually 
possible to control only the testing. 

Pupils differ in their interests both in and out of school, and because 
of the comprehensive nature of a beginning course in general science 
almost every pupil finds sections dealing with material relating to his 
particular interest, hobby, or out-of-school environment. Such mate- 
rial becomes more significant and consequently more readily retained. 
Similarly, those units closely related to material presented in other 
courses are likely to be more easily acquired and retained since over- 
learning in varying degrees is operating. Also it should not be over- 
looked that the double testing program (acquisition and retention 
simultaneously after the first two weeks) is, doubtless, an important 
factor in the degree of learning as well as in the degree of retention for 
varying intervals. The testing program itself is an important phase 
of the training which the pupils received. 

The results of the study, it is believed, throw new light upon learn- 
ing during the progress of a course. There are a number of experi- 
mental investigations which use standardized or nonstandardized 
achievement tests to measure improvement, but these studies seldom 
extend over more than two widely separated testing periods. Then, 
too, the majority of these investigations treat learning as incidental to 
the objective of the experiment, the primary aim being to test the 
effectiveness of such experimental factors as methods of teaching or 
motivating devices. There appear to be few, if any, objective investi- 
gations which trace cumulative progress over extended intervals of 
time. In the case of retention studies involving school subjects, 
investigators have either (1) determined how much was known at the 
end of a course and how much was known at definite intervals there- 
after or (2) measured the gains in a subject or course during the period 
of instruction (using initial and end tests) and determined how much 
of the gain was retained after varying periods of time. Of the two 
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procedures the latter is obviously the more accurate. These studies, 
like those dealing with acquisition, are usually based upon not more 
than two testing periods and consequently furnish no basis for showing 
cumulative progress over a period of time or for determining the sus- 
tained effects of training for varying intervals. 

A few unique characteristics of the techniques employed may be 
mentioned. First, the study deals with typical pupils in typical 
public-school classes with a minimum of variation from the established 
procedures of the regular teacher. Second, learning is conceived as 
the concurrent or simultaneous functioning of acquisition and for- 
getting, that is, learning involves a continual interplay of the two 
processes. Third, since learning is treated not only as the functional 
intermingling of the two processes, but also as a net result of cumulative 
or progressive growth, the study extends over a sufficiently long time to 
carry on repeated and continued testing. 


EXAMINE THE EXAMINATION 


R. W. EDMISTON 


Miama University 


Examinations, tests, quizzes—in fact, all types of measures 
employed in the school are used to provide definite evaluations of 
some ability. The objective test movement’s popularity was partially 
due to a desire for improvement of measuring devices. The various 
forms of the objective test demanded precise directions if the achieve- 
ments to be evaluated rather than an understanding of how to proceed 
with the examination were measured. In the endeavor to demonstrate 
the superiority of the objective test over the essay type, experiments 
were conducted from the results of which the greater reliability of the 
objective test was established. An examination of these experiments 
discloses directions accompanying the objective test and further 
instructions for scoring which were absent from the essay examina- 
tions. In the case of arithmetic tests, this absence of scoring direc- 
tions constituted the only difference between some of the old and 
new types of examinations. These observations lead one to assume 
that careful directions should accompany all examinations. What 
features should these directions emphasize? 


BOTH TEACHERS AND STUDENTS ARE CRITICAL 


In order to provide points for college students’ consideration when 
taking examinations, both students and teachers were asked to 
report any difficulties which lessened the reliability or validity of 
tests and examinations. The instructors designated as the main 
difficulties in obtaining valid and reliable test results: (1) illegibility, 
(2) poor vocabulary, (3) lack of ability to express, (4) incompetence or 
carelessness in following directions, (5) loose papers without names, 
(6) waste of time on a question that student does not possess informa- 
tion to answer, (7) so much written the necessary answer is obscured, 
(8) too much haste, (9) tendency to give the benefit of a doubt to 
students with good records and remove same from students with poor 
records, (10) emotional reactions of students when taking a test, and 
(11) crowded classrooms. 

The students repeated many of the points mentioned by their 
instructors. They gave a little different interpretation to what the 


instructors listed as carelessness in following directions by entering 
126 





nn ee 


—e 
) ee 


Seems sa owes o wo 








to 
or 


id 


Pir 
he 
ng 





Examine the Examination 127 


statements to the affect that (1) questions are vague and poorly stated 
and (2) the directions are neither clear nor complete. The possibility 
of guessing on new type questions received frequent mention. The 
students specify cheating which the instructors ignored unless implied 
in (11) above. Other features offered by the students are: (1) poor 
health at time of test, (2) nervousness whenever taking an examina- 
tion, (3) lack of confidence in self, (4) student scorers who do not 
understand materials fully and favor friends, fraternity brothers, and 
sorority sisters, (5) inability to say what one means, (6) poor division 
of time per questions, (7) necessary haste to complete long examina- 
tions, (8) carelessness, and (9) bluffing. 

Both students and professors have offered criticisms which desig- 
nate needed remedies. This paper presents some studies of the 
difficulties enumerated in order to suggest remedies. A page of 
directions for the general test or examinations is the outcome. 


COLLEGE ILLITERACY 


The instructors’ reported difficulties—(1) illegibility, (2) poor 
vocabulary, (3) lack of ability to express, and (4) incompetence or 
carelessness in following directions—suggest that college students are 
illiterate. Perhaps the statement, ‘‘incompetence or carelessness,” 
indicates the true condition. 

One hundred freshman examination papers were inspected for 
indications of the difficulties expressed in the above four criticisms. 
Three students’ papers provided the total of five definite occurrences 
of illegibility. These passages could not be deciphered by three 
people even when an ordinary reading glass was used. Two of the 
five instances were the result of students using a very soft pencil. 
These passages might have been read if a harder pencil, or ink, had 
been used. Another student furnished two illegible passages. Her 
writing was so small and crowded together that many statements 
were difficult to read. The third student was a very poor writer. 
Her entire paper was difficult to read. These papers would irritate 
many scorers. ‘Two papers had 7’s similar to F’s on true-false parts. 
The correct writing instrument, provision and use of sufficient space 
for answers, and ability to write are indicated needs. Adjustments 
for the first two might be made by appropriate directions. A demand 
for legibility might aid. This same group was given a later examina- 
tion with directions 2, 4, and 10 as they appear in the form at the 
end of this article. 





128 The Journal of Educational Psychology 


2. Write legibly. Your answer can’t be right if it can’t be read. If a 
T can not be distinguished from an F, the answer is wrong. Be sure 
your pen or pencil (if allowed) fosters distinct and not blurred writing. 

4, Space (the backs of sheets, the margins, or an extra sheet) should be 
used for: 

a. computations. 

b. practice in the formation of desirable statements, not padded 
but furnishing quality rather than quantity to the answer. 

c. the hasty jotting of facts pertaining to some questions when 
these facts arise while working upon another question. 

10. Reread each answer before passing to the next question and the com- 
pleted examinations before delivery to the instructor. Is the meaning 
clear and the writing legible? 


The first pages of the papers looked better than the previous set 
upon direct comparison. Some second pages showed improvement. 
Later pages showed little or no advancement, although certain words 
had been rewritten as if in response to direction 10. Of the three 
students with the five illegible passages in the former test, one, the 
poor writer, had the only illegible passage. The 7’s and F’s were more 
readily distinguishable. The illegibilities on the first test had not 
been called to the attention of the students. 

The same one hundred freshman papers provided forty-two indica- 
tions of vocabulary difficulties. One student used capital and corporal 
as synonymous when discussing punishment, fourteen used verbal 
and oral synonymously, seven added physical age as a different age 
when anatomical age was already given, and twenty did not understand 
the term correlate. Directions 3, 4, and 5 on the form are: 


3. Use terms or a vocabularly suited to the subject. Do not use a word 
unless its meaning is clear to you, and repeat a word rather than use 
another which may not have exactly the same desired meaning. 

4, Space (the backs of sheets, the margins, or an extra sheet) should be 
used for: 

a. computation. 

b. practice in the formation of desirable statements, not padded but 
furnishing quality rather than quantity to the answer. 

c. the hasty jotting of facts pertaining to some questions when these 
facts arise while working upon another question. 

5. The statement of each question must be fully considered. Carelessness 
not only penalizes the student but also lowers the dependability of the 
measurement obtained by the instructor. 
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After these directions had been given, one hundred freshman papers 
showed none, fifteen, five, and eighteen errors on the points reported 
above. The results signify that more than these instructions are 
needed to remove vocabulary difficulties. 

The lack of ability to express is tied up with the lack of vocabulary. 
The detection of failure to answer due to inability to express was not 
investigated in any other manner because of the uncertainty of data 
resulting from reports like, ‘‘I knew the answer but could not state it.” 
Failure to follow directions is considered in the following experiment. 


CREATURES OF HABIT 


The examination of freshmen’s test papers showed errors due to 
various failures to follow directions. Whether these errors were due 
to incompetence and carelessness on the part of students or failure to 
provide clear and complete directions on the part of the instructors 
offered opportunity for argument. However, when only 7’s were 
marked on true-false questions with directions to mark 7’s and F’s, the 
student must have been at fault. The same was true when + and — 
were substituted for the requested 7 and F. The instructor who 
wrote multiple choice questions with more than one correct completion 
to be checked took some chance in departing from the more common 
procedure of offering one correct completion. However, since the 
directions said that one or more completions could be correct and that 
all correct completions were to be marked, the student who marked 
only one of two recognized correct completions still appeared at fault. 
If the statements following these directions had contained an example 
with no correct completions, the maker of the test might be held 
responsible for misunderstandings. Directions 5, 6, and 7 were placed 
with the examination directions to take care of this point: 


5. The statement of each question must be fully considered. Carelessness 
not only penalizes the student but also lowers the dependability of the 
measurement obtained by the instructor. 

6. The directions telling how to answer the questions should be carefully 
followed. Underscore the important points in the directions. 

7. In essay questions, underscore the part of the statement that furnishes 
the direct question asked. Then underscore any parts of the statement 
which furnish data for the answer. Number each part so that you will 
not omit anything from your answer. 


The effectiveness of underscoring the important part of a direction is 
indicated in the following report. 
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By giving essay tests and repeating them two weeks later with 
neither further instruction nor return of the first papers, the value of 
underscoring as an aid in the exact determination of what a question 
asks may be demonstrated to students. Consider the following 
question: ‘‘What step in learning the motor habit, writing, would be 
furthered by tracing?”” An improvement of five per cent is produced 
in the results of fifty freshmen! by applying directions 5, 6, and 7 before 
answering this question. An analysis of improvements designated 
the following: When step was underscored, the student noted that it 
was singular and not plural. When motor habit was underscored, it 
suggested the answer, motor codrdination. When tracing was under- 
scored, no one mistook it for training. The answers to the three 
questions below showed an improvement of eight per cent with 
underscoring.” 


1. Give an example of a problem attack or piece of learning by the induc- 
tive method. Give the steps in procedure and number each; follow 
each step with the appropriate procedure for your example. Answer 
on the following lines. 

2. What negative transfer will be introduced by learning a poem by 
stanzas rather than by wholes? 

3. Complete the following by introducing (1) memory, (2) imagination, 
or (3) reasoning on the appropriate blanks. 

One’s ambitions demand 
In making hypotheses, reasoning is added by 
In providing data, reasoning has the aid of 
In creative thinking, reasoning is aided by 











Analysis showed that all of the parts asked in (1) became more 
apparent when underscored. The omissions made before underscoring 
would recommend fewer points in the same question or some procedure 
such as numbering to bring out each point. The statement to number 
each step in a question was added to direction 7 after these experi- 
ments. Some students did not see the negative in question (2) before 
underlining. Several reversed the consideration to ‘“‘by wholes 
rather than stanzas”’ before directed to underscore. Number (3) looks 
foolproof but the ‘‘or”’ is not so noticeable until after the underscoring, 
which lessened the number of double answers. More (1), (2), and 





1 No improvement was shown on repetition by the equated fifty of the original 


one hundred answering the question. 
2 Experimental group improved mean score by eight per cent, control by .9 per 


cent. 
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(3) answers—the numbers preceding the correct word rather than the 
word—were given when administered without directions, and these 
are not correct according to the key. However, these numbers were 
given credit on both first and second scorings and do not enter into 
the eight per cent improvement. 

The effect of underscoring the important words in the questions 
before presenting to students was also tested. The same questions 
were used and little difference—less than one per cent—was found 
upon administering without and with underscoring This group 
gained another eight per cent when given the test to underscore for 
themselves. The larger gain than previously reported may be due to 
the extra administration of the test. This group had taken the test 
three times with a two-week interval between administrations, no 
papers returned, no class discussion, and no further instruction on 
these points. 

Directions 5, 6, and 7 are recommended to break students’ habits of 
reading carelessly and answering what they see at a glance rather 
than the question resulting from careful reading. 


WHAT’S IN A NAME 


“‘Write your name conspicuously on each sheet” is a caution not 
recommended for examination directions. This statement aids any 
predisposition to favor one student more than another. Both students 
and college instructors mentioned this tendency in personal interviews. 
Further information was obtained to provide more definite data. 

Iifty experienced teachers were asked to answer “‘yes” or “‘no”’ 
to the following questions: (1) Should the better pupil be given the 
benefit of a doubt, which benefit is not given to other pupils, when 
scoring papers? (2) Do you follow the above rule in marking papers? 
Answers showed twenty-six yes’s and twenty-four no’s on question 
(1) with twenty-five of each on question (2). The papers were not 
signed and the teachers were asked to consider their answers carefully 
since scientific data were desired. Three teachers answered ‘‘yes’”’ on 
question (1) and ‘‘no” on question (2), and two others answered 
“no” on question (1) and ‘‘yes” on question (2). Others gave the 
same answer to both questions. Repetition of these questions to the 
same group on consecutive days brought varying results on question 
(2) with little change of answers to question (1). Question (2) was 
changed to read: Do you ever remember referring to the pupil’s name 
before deciding whether or not an answer was to be accepted? The 
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question brought so many affirmative answers that the existence of 
this practice seemed assured. To substantiate further this evidence, 
the original questions (1) and (2) were presented for discussion to a 
group of secondary-school teachers and a group of college students. 
The ‘‘ Do you”’ in (2) was changed to ‘‘Should you”’ for the students. 
In both cases the discussion began with the statement that the better 
pupil was more likely to have meant to g‘ve the correct answer and 
should receive the benefit of the doubt, whereas a poor pupil should 
not be so favored. After discussion, both groups decided against this 
practice and concluded that any statement should be evaluated with- 
out reference to the maker. 

Two instructors were asked to score sets of one hundred papers 
without and with the owners signatures conspicuously recorded. The 
papers were scored first without names and twice thereafter with 
names. One week elapsed between successive scorings. The follow- 
ing data indicate the effect of the knowledge of authorship. 


TaBLE I.—MeEans, THEIR DIFFERENCES, AND THE SIGNIFICANCE OF THESE 
DIFFERENCES FROM SCORINGS OF PAPERS BY THE SAME SCORERS WITH AND 
WITHOUT INDICATION OF AUTHORSHIP 














° Without , 
N = 100 siaiidiiie With names Diff. Chances 
PE. of true 
“4 (diff.) difference 
Mean} PEy | Mean! PEy enna 
Scorer No. 1.......... 72 1.5 78 ie | 2.3 94 of 100 
Scorer No. 2.......... 41 0.7 49 ‘2 5.7 100 of 100 




















Reading from left to right, scorer No. 1 marked the one hundred 
papers when no name was recorded and the resulting mean was 72 
with a probable error of 1.5; when names were recorded on these 
papers the same scorer gave marks providing a mean of 78 with 
probable error of 2.1; the difference between these means divided by 
the probable error of the difference gave 2.3, a figure which designates 
that the difference has ninety-four chances in one hundred of being a 
true difference. The last line should be read similarly but relates to 
the marks given by scorer No. 2. 

Table II should be read similarily to Table I. rj. is the correlation 
between (1) scores given when names were not recorded and (2) scores 
when names were recorded. 23 is the correlation between two sets of 
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scores obtained at different times from the same group of papers with 
names appearing in both cases. 

While the students in the two groups considered were not identical, 
the same student received the greatest increase when without-a-name 
was compared with with-a-name score. Both instructors reported, 
upon being asked what they thought of the student, that she was unin- 
telligent and of poor personality. The students receiving greatest 
increase when with-a-name score was compared with without-a-name 


TABLE IJ.—SIGNIFICANCE OF THE DIFFERENCES BETWEEN THE CORRELATIONS OF 
Scores GIVEN WITH KNOWLEDGE OF AUTHORSHIP (1) WITH A SIMILAR SET 
AND (2) witH A THIRD SET PROVIDED WITHOUT THIS KNOWLEDGE BY THE 
SAME ScORERS 











Product moment correlations with Diff. Chances 
” . of true 
PE's PE waits.) difference 
Scorer No. 1..| rig = .71 + .033 | re, = .80 + .024 2.2 93 of 100 
Scorer No. 2..| rig = .75 + .029 | res = .85 + .019 2.9 97 of 100 














score in each group were considered intelligent and well prepared in 
class. The actual records of intelligence proved the instructors’ 
estimates to have been correct. However, intelligence will have 
enough effect upon the recorded content of examinations without 
adding and subtracting further increments. 

The above data showed the effects of a knowledge of authorship 
upon marks. The means in Table I illustrated that scores in general 
were higher when names were recorded on papers than when no name 
appeared. The study that follows offers more data on this situation. 

The appropriate direction would be similar to the following: “‘ Your 
name should appear on the back or last sheet of the examination if 
sheets are securely bound. Each loose sheet should have the name 
entered inconspicuously, preferably where it need not be seen by the 
scorer when scoring.” This statement should make favoritism more 
difficult. It is the first on the direction sheet. 

For the student who desires to take advantage of favor and escape 
the results of disfavor, the following rule is offered: “‘If you have 
convinced the instructor of your superior ability by former success, 
place your name conspicuously on each sheet of the examination; if 
unfavorable impression exists, emphasize the inconspicuousness of 
your signature.” 
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IT IS MORE BLESSED TO GIVE 


To make further study of the truth of students’ statements that 
personal likes and dislikes affect scorers, modified essay tests were given 
to students to score. The one hundred ten students in this experiment 
had been asked during a period of personality rating to classify each 
other as (1) friend, (2) relative, (3) acquaintance, (4) unknown to me, 
or (5) known and not liked. 

These students were in four class groups and had no knowledge of 
any relationship between these ratings and their marking of papers. 
Directions for scoring papers gave statements to be accepted and 
warnings to credit no other unless certain the answer carried the same 
meaning as that of the key. Three of the clerical force scored the 
same thirty papers. ‘The three sets of scores obtained from the clerical 
force gave means of 11.11, 11.21, and 11.33. The reliability of their 
scoring when following the same directions and key given the students 
is designated by the correlations between the sets of scores obtained: 
rig = .95; riz = .97; re; = .91. No marks were placed on papers. 
The scores were recorded on sheets provided for this purpose. The 
name of the student who wrote the paper and number of scorer 
appeared upon each. Eight students scored each paper. Analysis of 
results do more than furnish data to substantiate the statement that 
friends receive undeservedly high marks from student scorers. Results 
show that some students tend to give undue credit to friends and 
acquaintances and that others are supercritical of all. Some are more 
critical of students marked friends than of mere acquaintances. No 
one showed more favor to ‘“‘not a friend” than to others. However, 
some students marked papers according to directions although the 
names of the writers of the papers appeared conspicuously on each 
sheet. Table III presents the data under six divisions or descriptions 
of scorers’ relation to scored. The ‘“‘relative”’ classification is omitted 
due to the small number of cases. 

Further scoring of objective types showed less but similar variations 
which, if not due to carelessness, must be considered intentional. A 
sampling of four hundred standardized tests in arithmetic was checked 
for errors in scoring. There were one hundred twenty-two errors; 
eighty-one in the pupil’s favor. The tendency to give is again 
emphasized. 

The results of these experiments favor some indication of authorship 
of each examination paper which wiil permit recognition of writer and 
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still be inconspicuous. Some instructors have numbers entered on 
papers and do not disclose identities to scorers. Others ask that 
names be placed in positions not accessible to scorer without the effort 
of turning a page. If instructors mark their own papers, the identifi- 
cation mark must be placed in an inconspicuous position. Some 
teachers recognize handwriting and, therefore, know whose paper is 
before them. In any case a conscious effort should be made to score 


TABLE ITI.—MEANs AND STANDARD DEVIATIONS OF Scores GIVEN BY DIFFERENT 
CLASSIFICATIONS OF SCORERS ON SAME SET OF PAPERS 


























N = 44 Friend Acquaint-| Unknown Not a In- Self 
ance to me friend structor 
Average scores 
CE 15.26 16.18 14.22 13.19 10.99 14.75 
ere ee 3.43 3.53 4.22 4.48 2.93 3.55 





all papers on content only and a sheet of directions for scorer seems as 
necessary as one for the students taking the examination. 


HASTE IS WASTE? 


Can directions aid the less confident who in seventeen of twenty- 
two reports were the same students who stated that examinations 
were too long? [Examinations which appear easy of completion for a 
majority continue to receive this criticism from a minority. Is this 
minority poorly informed upon the achievement measured, slow by 
general habit, or lacking in method of attack? If the last named 
difficulty causes lower marks, directions might lead to improvement. 


TaBLE IV.—Megans InN Per CENTS WITH THEIR STANDARD DEVIATIONS FOR 
Groups’ Test ScoORES, WITH AND WITHOUT DIRECTIONS 














Test With directions Without directions 

M o M og 
No. 1 | Group No. 1 72 6.33 Group No. 2 62 6.88 
No. 2. | Group No. 2 65 6.74 Group No. 1 70 6.48 























To study this hypothesis, two groups of fifty each were equated in 
accordance with former tests in psychology. The groups were rotated, 
with and without direction 8, when two tests in psychology were 
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administered with an intervening period of two weeks. These tests 
were numbered 1 and 2 in order of presentation. The data obtained 
are recorded below. 

The differences are in favor of the directions in both instances, 
group No. 1 showing an improvement of two per cent and group No. 2 
of three per cent. Since group No. 1 had the directions on the first 
test, some carry-over may explain their lesser improvement. However, 
neither difference is statistically significant.' Considering the indi- 
viduals of group 2 who had the tests first with and later without direc- 
tions, five did not write on the last question on test No. 1. Three of 
these five wrote on the last question in test No. 2. Their scores in 
per cents changed from five to nineteen, seventeen to twenty-three, 
and thirty-one to forty, respectively. The other two failed to com- 
plete test No. 2. These improvements may be statistically significant, 
but the achievement was so low that it would take more than this 
statistical improvement to bring satisfactory achievement. The last 
question was the least difficult in both tests. All students in group 1 
wrote on the last question in test No. 1 which was given with directions. 
One of this group failed to write on the last question on test No. 2. 
Her change in score was from thirty-six to twenty-eight. These 
directions are of some value to pupils who fail to answer questions in 
the latter part of an examination. A greater number of this selected 
type would be necessary to furnish definite conclusions. 


CONCLUSIONS 


The directions provided from this study offer definite improve- 
ments in students’ records of achievement obtained from examinations. 
They increase the validity of the examination by obtaining a truer 
record of the knowledge of the examined and a more exact scoring of 
this record. Since the directions were definitely called to the attention 
of the groups being tested, instruction in taking examinations rather 
than the mere provision of directions was offered. Perhaps a definite 
provision for this instruction should be made in the program of studies. 
Since underscored important parts of the examination did not bring 
the results obtained by the requirements that students underscore, the 





1 By dividing the differences by the computed standard deviations of the 
difference, .32 and .31 showing sixty-three chances in one hundred and sixty-two 
chances in one hundred of a true difference were obtained for groups 1 and 2, 
respectively. 
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need of students’ attention through practice of the recommended pro- 
cedures of the directions is emphasized. 

Some of the studies described need repetition with groups of 
students who have designated or shown the particular difficulty to be 
remedied. ‘This is especially true of the last study in which most of 
the students were not of the group who could not complete the ordinary 
examination. Further study may also provide directions to remedy 
features not affected by the directions as reported in this study. 
Some interested person whose students are not acquainted with these 
directions might determine the effects upon examination results of the 
whole instructional sheet. 


Name (Section 1 below) Date 
Course 








IMPORTANT CONSIDERATIONS IN TAKING EXAMINATIONS: 


1. Your name should appear on the first or last sheet of the examination if 
sheets are securely bound. Each loose sheet should have the name entered 
inconspicuously, preferably on back where it will not be seen by the scorer, 
when scoring. 

2. Write legibly. Your answer can’t be right if it can’t be read. If a T 
can not be distinguished from an F, the answer is wrong. Be sure your 
pen or pencil (if allowed) fosters distinct and not blurred writing. 

3. Use terms or a vocabulary suited to the subject. Do not use a word unless 
its meaning is clear to you, and repeat a word rather than use another 
which may not have exactly the same desired meaning. 

4. Space (the backs of sheets, the margins, or an extra sheet) should be 
used for 

a. computations 

b. practice in the formation of desirable statements, not padded but 
furnishing quality rather than quantity to the answer 

c. the hasty jotting of facts pertaining to some questions when these 
facts arise while working upon another question 

5. The statement of each question must be fully considered. Carelessness not 
only penalizes the student but also lowers the dependability of the meas- 
urement obtained by the instructor. 

6. The directions telling how to answer the questions should be carefully 
followed. Underscore the important points in the directions. 

7. In essay question, underscore the part of the statement that furnishes 
the direct question asked. Then underscore any parts of the statement 
which furnish data for the answer. Number each part so that you will 
not omit anything from your answer. 
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8. 


10. 
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Proceed directly through the examination with no lengthy consideration 
of unfamiliar points. After completing the parts which were readily 
answered, start again and answer those questions which yield to more 
diligent effort. Do not waste time by trial and error method upon ques- 
tions which bring no recognition or recall of related materials. After 
completing the second consideration of the test, spend the remainder of 
the time upon the more familiar of the unanswered questions. Note that 
hesitation wastes time, ruins confidence, and destroys mind set. 

If after thorough consideration you do not understand some direction or 
question due to other than lack of knowledge of the course, call the 
attention of the person in charge with as little disturbance as possible in 
order that the tester may come to your seat or allow you to come to 
him as conditions may determine. 

Reread each answer before passing to the next question and the completed 
examination before delivery to the instructor. Is the meaning clear and 
writing legible? 











MENTAL FACTORS OF NO IMPORTANCE' 


TRUMAN L. KELLEY 


Harvard University 


In the excellent treatise by E. Spranger upon Types of Men, 1928, 
we have an illustration of a speculative philosophy of typology. 
Spranger gives detailed verbal explanations of the mental character- 
istics defining the seven fundamental types which he postulates. Such 
an approach as this to the problem of mental factors which is pre- 
dominantly speculative is classic, antedating the Christian Era. 

Many of the historic classifications of personality, ability, or 
temperament, have attained no popular recognition, some have had 
short periods of acceptance, and a few have had a vogue of generations 
or centuries. The classification arising in the Middle Ages of tempera- 
ments into Sanguine, Phlegmatic, Melancholic, and Choleric affected, 
or served, thinking for some centuries and to a slight degree still does. 
The caste system of India, differing from feudal stratifications in that 
it is in theory based upon differences in mental ability and outlook 
and not upon wealth and power, assigns individuals to priestly, 
warrior, merchant, and servant classes. This system is now under 
bitter attack in its native land for reasons probably related to some 
original weakness in it, but, on the other hand, the reasons may merely 
be related to cancerous accumulations of the centuries. 

We, in America, in 1938, do not live under systems of thought 
dominated by humours, or Hindu castes, but from their viability we 
may be sure that they have proven serviceable in important connec- 
tions. Even in the literature of today descriptions of the phlegmatic 
and the choleric types are common and the characters of ‘‘ The Servant 
in the House” and of ‘“Babbit”’ faithfully represent Hindu castes. 
There surely was and is a truth in these historic demarcations. 

Many of the important distinctions between ‘‘free” and ‘‘slave”’ 
antedating the Civil War did not disappear with the Emancipation 
Proclamation and the close of that war. However overlaid and con- 
fused the issue because of a certain correlation between skin color and 
mentality, nevertheless, the persistence for over two generations of 
certain aspects of the early distinctions proclaim a reality or element 
of truth in them that can scarcely be gainsaid. 





1 An address delivered before the American Psychological Society, Psycho- 
metric program, September 8, 1938. 
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Is it not presumptuous to believe that, in almost no time at all, 
we can by modern methods of analysis, by the use of up-to-the-minute 
tests—mainly of the paper and pencil sort—and with the aid of 
ingenious tabulating and computing machinery, forthwith supplant 
systems of thought that have stood the pragmatic test of generations 
or centuries of use? 

“. Typologies, castes, distinctions in mental make-up that have stood 
the test of time are not grounded in mental factors of no importance. 
The first earmark of an unimportant factor is its Venus-like birth 
from the meditations of a Spranger, the libido of a Freud, the hunch of 
an enthusiastic employment officer, or the dial of a differential analyzer. 
The second earmark is also Venus-like—the factor stands in virgin 
purity, untrammeled by clothes of doubt, untouched by considerations 
of probable error, unqualified by tentative endorsement. The third 
earmark is almost a consequent of the other two—the factor has never 
been put to work, it has never served the needs of man in school, in 
business, in social adaptation. Truly none of these earmarks detract 
from the beauty of the picture and possibly this Venus will scrub floors 
as well as win acclaim for harmony of proportion and then, indeed, we 
shall be blessed. But initially let us call a factor with these earmarks 
just a Venus-factor and not load it down or trust it with heavy work. 

What is the heavy work that is to be done? We have eight million 
unemployed, another many millions feeling thwarted and believing 
that could they but find a channel of expression more fitted to their 
talents, their life would be richer and society of which they are a part 
would be the better. The heavy work resting squarely upon the 
shoulders of the typologist, of the mental factorist, of the character 
analyst, and upon the comprehensiveness, accuracy, and analytical 
nature of the mental measures that he uses, is to render aid in the 
mental and social adjustments necessary to alleviate the thwartings 
mentioned. 

This heavy work must not be encumbered with trivial mental 
factors. Let me suggest a Factor-of-no-importance that might be 
derived by the method of statistical analysis of test data. If tests of 
taste sensitivity to para-ethoxy-phenol-thio-carbamide were included 
in a series of tests to be factorized by matrix methods I have no doubt 
that a taste trait would evolve as an independent factor. Certainly 
it is real, psychologically real, perhaps genetically real, but still in 
comparison with those mental, sensory, and motor things that underlie 
the adequate adjustment of individuals in the society in which we live 
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it certainly is a Factor-of-no-importance. I can, by a vigorous stretch 
of the imagination, conceive of a society in which such a factor would 
be important. I do not object to students spending their time upon 
such factors with the hope of discovering something of genetic or other 
importance, but I do insist that evidence of existence of a factor be not 
cited as evidence that it is important in the meeting of pressing guid- 
ance and social problems. 

Modern attempts to develop a psychology of types are aided by 
methods of measurement and of analysis not dreamed of by the classic 
typologists. This is all to the good, but it is not a substitute for the 
pragmatic test of usage. These modern methods should do four 
things: Provide us (a) with more and more precise means of measure- 
ment of human traits, (b) with the means of so combining measure- 
ments that we have simpler and more economical systems of thought, 
(c) with quantitative measures of stability of factors, and (d) with 
quantitative measures of serviceability of the traits employed. I 
believe modern methods have made notable strides in the first two 
connections, but that they have been so lacking in meeting the latter 
two needs as to cast discredit upon the entire movement. We need, 
but do not have, quantitative statements of stability of factors as age 
and group changes and of serviceability in scholastic and vocational 
connections. 

A hypothetical example which will be very elementary to one 
familiar with transformation of axes may make the problem clearer: 
Suppose we have two independent factors which are of prime impor- 
tance in a certain limited field of life—I am certain it would be a 
limited field if only two were called for. It is found by an experimental 
study that in this field independent factors A and B completely 
determine all the significant individual differences. A second study 
yields factors C and D, these being, let us say, rotations in the two 
dimensional field that concerns us of factors A and B. Thinking in 
terms of C and Dis, under proper conditions of weighting and emphasis, 
fully equivalent to thinking in terms of A and B, but the fact that this 
is so establishes the instability of either set of factors. We at present 
lack quantitative statements of such instability, and the lack is 
crucial. This is not a mathematical problem to be resolved by showing 
the mathematical equivalence of transformations, or a philosophical 
problem to be solved by logical considerations of complete simplicity 
or uniqueness of configurations, but a quantitative problem of the 
relative utility of different systems of thought. 
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I am not discouraged with the outlook, for several of the short- 
comings mentioned are remediable by attainment along a single line. 
A typology based upon principles of parsimony and independence of 
factors promises to meet simultaneously several important needs. 
First, it in part (the other part being consequent to social demand) 
announces the direction in which our test construction endeavors 
should proceed. Second, it provides a wonderfully simple thought 
system. And third—and this I believe has not been noted before- 
correlative measures of stability seem attainable for such a system 
with a mathematical ease not present for other systems. Thus all 
of the main objectives cited, except that of functional importance in 
the society in which we live, seem to flow from a typology adopting 
this single standard of independence and parsimony of factors. | 

I have not the necessary information which would permit me to 
characterize any of the factors found by recent mathematical analyses 
of data as being mental-factors-of-no-importance, but is it not neces- 
sary for us to assert that none have attached to them standard errors 
of stability, or quantitative measures of utility, and that they are 
accordingly of undetermined importance. Under this state of affairs, 
my fear is that many of the factors thus far ‘‘found” approach pretty 
close to the limit of no importance. 

As I see it, the chief evidence of further advance in the definition 
and measurement of mental factors will lie in incorporating into our 
procedures measures of utility, or of social needs to be met by the use 


of the analytical information provided by reliably measured mental 
factors. 








INCREASING RELIABILITY IN 
CONTROLLED EXPERIMENTS 


CHARLES C. PETERS 


The Pennsylvania State College 
THE PROBLEM 


As one reads the literature reporting findings from controlled 
experiments, one is stuck with the frequency with which it is reported 
that ‘‘no statistically significant difference was found” between the 
experimental situation and the control. The implication is often left 
that the two procedures compared are of equal value in contributing 
toward the results measured. Such a finding gives, on the face of it, 
grounds for suspicion. There are probably few cases in which the 
alternative is really equal; certainly not so many of them as the litera- 
ture in question would suggest. The anomaly is due in part to a 
misinterpretation of statistical significance. If the researcher obtains p 
a ratio of somewhat less than the conventionally demanded three, 
between the difference and its standard error, he implies or asserts 
in his interpretation that the alternative is probably equal in value, 
whereas the odds may be several hundred to one that it is superior, 
though somewhat less than the seven hundred forty to one that a ratio 
of three indicates. He has merely failed to prove with high con- 
clusiveness that the one is superior to the other. The other reason for 
sO many negative outcomes is smallness of populations or lack of 
adequate measurement of outcomes. It is to these latter aspects 
that this paper is addressed. I shall point out certain features which 
would, if employed, increase reliability, and shall defend the allegations 
by some statistical considerations. 


EXTENDED MEASUREMENT 


The writer has been frequently astonished at the manner in which 
standard error ratios mount when a number of tests are administered 
during the course of an experiment and the scores on these tests 
summed. This is illustrated by the policy of teaching the experi- 
mental materials in units, giving comparable tests at the close of the 
several units, then summing each student’s scores on these into a 
composite for the year. In the experiment by Rice,' for example, 





‘Rice, Ralph S.: ‘‘Extensive Reading versus Intensive Textbook Study as a 
Means of Acquiring a Knowledge of Scientific Facts and Principles.” Journal of 
Experimental Education, Vol. IV, June, 1936, pp. 376-402. 
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where the standard error ratios for unit tests averaged about two or 
2.5, the ratios when the tests were summed ran up to from seven to 
twelve, or even higher. At first sight it seems incredible that reliabili- 
ties could be increased so markedly merely by summing subtests, but 
the following statistical considerations will show that such increase 
is not only possible but probable. 

The standard error of a difference between means in a matched- 
group experiment is given by the formula! 


Od's 


Tm,—m, = JN 


where oq, 18 the standard deviation of the array of paired differences 
and N is the number of pairs of subjects in the experiment. The 
standard error ratio is, 





Dif. _ Dif./N 


Odif. Cd's 





, = 


For a single subtest this formula requires the difference between 
the means on that particular trial and the standard deviation of the 
paired differences from that trial. For the summed tests the difference 
would be that between the means of the summed scores, and the oq, 
would be the standard deviation of the array of differences between 
the summed scores. But the latter difference would be identically the 
same as the algebraic sum of the several differences, and the later 
standard deviation identical with the standard deviation of the sum of 
individual paired differences from the several subtests. 

Now, we could increase the standard error ratio by increasing the 
numerator more rapidly than the denominator, and we shall show that 
just that may be expected to happen from summing subtests. 

We assume that there is some real difference in potency between 
the experimental set-up and the control set-up which operates to 
separate the means of the two aed * part of any obtained differ- 
ence is due to this factor and a part to the unreliability of our measure- 
ments. Assuming that the experimental factor remains equally potent 
throughout the experiment, the increment in the difference due to it 
will have the same sign in each trial and be equal in amount, so that 
when we sum a subtests together the amount of separation due to it 





1 Peters, C. C., and VanVoorhis: Statistical Procedures and Their Mathe- 
matical Bases. School of Education, Pennsylvania State College, 1935, page 143. 
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will be a times as great as in an individual test. But the contribution 
due to unreliability will fluctuate by chance as to sign and as to amount, 
and will tend to sum to zero in a plurality of subtests. If we call the 
difference from a single test D; and assume it to be the typical or 
average one, and call the difference from the summed tests D,, then D, 
will equal aD;. Thus the numerator of our fraction will vary directly 
as the number of tests summed together. 

But the denominator will mount less rapidly. The formula for 
the standard deviation of the sum of a correlated arrays is! 


o, = a,V/a + (a? on a)ry- 


If the arrays (which are our subtests) are perfectly intercorrelated, 
a, would equal ac;, and our denominator would be increasing as rapidly 
as our numerator. But in practice they will not intercorrelate per- 
fectly; the intercorrelations among successive sets of paired differences 
will be very low; perhaps, even slightly negative, certainly near zero. 
To the extent to which this average intercorrelation is low, the o of 
the denominator will have increased more slowly than the D of the 
numerator, and our standard error ratio will have increased by reason 
of summing the subtests. If rj, is zero, 


aD~/N -™ t/a. 
Ta'e,/a 


The above considerations suggest the importance of extensive 
measurement in experimentation. Substantially the same effect 
would, of course, be obtained by end measurement by means of very 
long tests, except for the disturbing effect of fatigue. 








t. = 


VALID MEASUREMENT 


The previous section was directed against the practice of using short 
and meager tests for measuring outcomes in experimentation; the 
proper separation of the means requires extensive testing. That is 
largely a matter of reliability. In this section we shall show the 
influence of validity of measurement in separating the means. Unfor- 
tunately it often happens in experimentation that commercial tests 
are employed to measure outcomes because they have prestige, or 
because they are readily available, when they are not very closely 
related to the difference the experimental factor could be expected to 


' Peters and VanVoorhis: Op. Cit. p. 174. 
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make. That is, they may be valid for other purposes but not particu- 
larly valid for measuring the outcomes of the particular experiment 
in question. We shall show that the difference between measured 
outcomes in experimental and control groups is likely to be attenuated 
by reason of this lack of validity in the testing instrument. 

Let c stand for the elements which a test measures that are affected 
by the experimental factor, let 6 stand for other identifiable elements 
validly measured as some kind of performance but not a performance 
affected by the experimental factor, and let e be chance factors caught 
in the measures. The test is, then, valid for its purpose to the extent 
to which it measures only the c factors. Each individual’s score will 
be made up of c + 6 +e factors. The difference between the means 
will be little affected by the b or the e factors, since they will tend to 
average about the same on the experimental and the control sides. 
But the variability will be affected by the b and the e factors as well 
as by the c factors; it will be increased by reason of them. 

If the test were perfectly valid the difference, measured in standard 


This latter 





D . * . . . . 
scores, would be —, while with an invalid test it is 

Oc . F(c+b+e) 
C- 





times as great as the former. But —~< 

FT (c+b+e) O (c+b+e) 
is the validity coefficient of the test—the correlation between its scores 
and perfectly valid scores of the same function. ’The proof of that 
is as follows: Consider our measures of c, b, and e to be in the form of 


deviations from the means of their respective arrays. Then 


Ye? + Yeh + Lee 


No.6 (c+b+e) 


is 








— ele tb+e), which 





Te(ce+b+e) = 


But, since b and e are uncorrelated with c, =cb and ce equal zero. 
Taking the N with the 2c? of the numerator, we have 


° Oc 


Te(e+b+e) = ™ 
TF (c-+b+e) F (c+b+e) 





If D; represents the difference if validly measured and D the difference 
obtained by the somewhat invalid test, we have, from the last sentence 
preceding, the proof D =rD,, or D, = D/r. Thus the difference 
would be separated if validly measured to an amount equal to the 
obtained difference divided by the validity coefficient of the test for 
measuring the particular function differentiating between the experi- 
mental and the control processes, when we are talking in terms of 
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standard scores. This will operate to raise the standard error ratio, 
since the o of the denominator, which has been decreased by elimination 
of the irrelevant elements from the test, is the one which appears in 
the standard error formula. 

The validity coefficient we are talking about here has nothing to 
do with the validity coefficients often published by test-makers, 
which are correlations with scores from other trusted tests measuring 
presumably the same function. We mean by the validity coefficient 
the coefficient of correlation between scores containing no factors other 
than revelant ones and corresponding scores containing some such 
factors—an r that could have, under ordinary testing conditions, only 
a theoretical meaning but about which it is, nevertheless, illuminating 
to speculate. 

Of course, in practice we could not ordinarily make this correction 
quantitatively, because we do not know the validity coefficient of the 
test for the purpose in hand. But this argument shows the danger 
involved in employing testing instruments in experimentation which 
have little pertinency to the experimental factor, and do little justice 
to it, then taking seriously the small and unreliable differences thus 
obtained. This very often happens in practice. It is not improbable 
that tests are sometimes employed for measuring outcomes in experi- 
mentation which are padded out by ninety per cent of irrelevant ele- 
ments while failing to include another ninety per cent of the outcomes 
actually influenced by the experimental factor. Under these con- 
ditions the real difference would be ten times as great (in standard 
terms) as the obtained one. 


REPLICATION OF EXPERIMENTS 


A third way to increase reliability is to repeat experiments and get 
a determination from the set of repeated trials rather than from a single 
trial. Ordinarily classroom conditions make feasible trials of an 
experimental factor with only small populations—the twenty-five 
or thirty pupils who can be matched with mates in two classrooms of 
the customary size. Since differences between experimental and con- 
trol outcomes are usually small, reliable determinations can not be 
obtained from such small populations. The experiment needs to be 
retried until satisfactorily stable results have been secured. If these 
retrials are made under different teachers and in different schools, the 
determination is more objective because the experimental factor is 
then freed from irrelevant elements involved in the personality of a 
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single teacher or a single social setting. But, if the true difference is 
small, obtained differences in these repeated trials may be expected 
to fluctuate from trial to trial, not only in amount but also in sign. 
That fluctuation is not at all alarming and is no discredit to the experi- 
mental technique; it may be due merely to the fluctuations in sampling 
to be expected in all statistical research in the social sciences. In 
replicated experimentation the reliability of the difference is not that 
of each of the samples taken singly but that of the set taken as a whole. 
We shall show that this reliability is much higher than that of the 
average single sample. 

We assume an experimental factor running through all the trials 
equally potent to separate the contrasted groups in all trials, except 





for the effect of chance errors of sampling. Let i, no, m3, . . . be the 
number of pairs of individuals in the several experiments. Similarly 
let Di, Do, Ds, . . . be the differences between means in samples, 
and let t,, ts, ts, . . . be the ratios of the several differences to their 
standard errors. Then 
D 
ti, = veal (1) 
op, 
But 
Od, 
Cp = ’ : 
= (2) 


where oq, is the standard deviation of the array of paired differences 
in Experiment 1. Furthermore, D; = M,d,, whence 


_ 2d; 








D: = = (3) 
Thus, substituting (2) and (3) in (1), 
1104, 
_ 2d 
Tin 


and, multiplying through by ~/n1, t0/n; = 2d;/ca,. 

Thus we have the paired differences in ‘‘standard (z) scores” taken 
from zero as origin. If now we use the symbol z for these scores and 
sum for all the experiments, we have 22zzg = Xt+/n, it being thus 
indicated that each ¢ is to be weighted by the square root of its corre- 
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sponding mn. If we assume that these deviations divided by the 
standard deviations of their respective series are sufficiently similar 
to permit averaging them without distortion of meaning, the difference 
between the means of the summed experimental and control groups 
will be the same as the mean of all the paired differences; namely, 


Dlzq_ Ltn 5) 
In =n 





D, = 





The ¢ for this set of combined z scores will be, by analogy with (4), 


St~/n - Vn 


Cr, =n 
t 





t= 


where now the Ce, is the standard deviation of the whole consolidated 


population of paired differences. If the za, were taken as deviations 
from the true mean instead of from zero, and if the z-scores were 
standard scores from the total population instead of from the sub- 
groups, this Oe, would be 1, which is always the value of the standard 


deviation of a set of z-scores. This latter condition need not worry us, 
since the sigma of a standard error formula is properly that of the 
total population of samples rather than that of a single sample. The 
former limitation also is not serious, since it involves merely adding 
a constant to each score as compared with deviations from the true 
mean, which does not affect the standard deviation. Thus, while o, 


t 


is not precisely the standard deviation of a total assembly of standard 


scores, it is nearly so and may be taken roughly to equal 1. So we 
have 


_ dt n- Vin _ Stv/n 





t, (6) 

This would lend itself to calculation if we had the several ratios 
and the several populations. But if we may assume substantially 
equal populations in the several samples, and that the standard devia- 
tion of the whole population of paired differences is substantially equal 
to | (which would be not far from true in spite of the fact that the 
deviations are taken from zero rather than from the true mean) our 
formula would greatly simplify, as follows: 





= BE Vin Ve 2 aava = MVa 


=n 1 
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Thus, if the samples are assumed to be equal in size of population, 
and equally potent in contributing to a validly measured difference, 
the ratio between the total mean difference and its standard error may 
be taken to be substantially the square root of the number of samples 
times the mean ratio. Thus from twenty-five determinations the 
standard error ratio may be expected to be five times as great as the 
average ratio from a single determination. 

The formulae of this section, just as those of the preceding sections, 
are not offered as useful formulae for actually making a quantitative 
computation of the correct statistic; the assumptions which must be 
made are too precarious to make that safe, except as a rough estimate. 
The argument is intended merely to stress the fact that the reliability 
of a set of experiments with differences prevailingly in the same 
direction is much higher than that of the average single trial. But it 
must be noted that this applies only where the populations of the 
several experiments are independent of one another; it does not apply 
with full force to the case where the same pupils are re-measured, but 
only where there are added chance samples of the total population 
regarding which the generalization is to be stated. None of the sec- 
tions were intended to furnish techniques for making quantitative 
correction for poor experimental conditions, but to urge: (1) Thorough 
and extensive measurement of outcomes, especially by summing 
similar tests; (2) the use of tests optimally valid for measuring the 
difference due to the experimental set-up rather than ready-made tests 
selected for prestige or convenience; and (3) the repetition of small- 
scope experiments and the estimation of reliability from the prevailing 
tendency in the set as a whole rather than from a single trial. 
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SEX DIFFERENCES IN VOCATIONAL INTERESTS 


F. H. FINCH 


University of Illinois 


AND 
M. E. ODOROFF 
Minnesota State Board of Control 


Carter and Strong! have, by comparison of the measured voca- 
tional interests of boys and girls of secondary-school age, afforded a 
basis for the tentative conclusion that the vocational interests of the 
two sexes show certain marked differences. The present study 
attempts to verify this conclusion, and also extends the comparison 
of the vocational interest scores of boys and girls to a younger group 
made up wholly of students in junior high school. 

Data were collected by the administration of the Strong Vocational 
Interest Blank for men to a total of four hundred sixty-seven individ- 
uals who were attending the University High School, University of 
Minnesota. There were one hundred six junior-high-school girls, 
twenty-three of whom were from the seventh grade, forty from the 
eighth, and forty-three from the ninth. Of the one hundred twenty- 
seven junior high-school boys to whom the blank was given, twenty- 
eight were from the seventh, forty-two from the eighth, and fifty-seven 
from the ninth grade. The senior-high-school group included one 
hundred twelve girls, seventy-two from grade XI and forty from grade 
XII; and one hundred twenty-two boys, of whom seventy-one were 
from grade XI and fifty-one from grade XII. 

That there was approximately three years’ difference in the aver- 
age ages of the junior and senior high-school groups will be seen from 
the following: 








Nv | Range, | y | sp 

months 
I, oi an ease tenn ewes cree es foé .| 126-196 | 165 | 11.1 
Junior high-school boys................seeeeeeee- /27 .| 128-197 | 166 | 11.9 
Senior high-school girls........................../4%.] 174-229 | 201 | 10.0 
Senior high-school boys.................2.-00000: ‘22 | 169-237 | 201 | 8.4 














It will be noted that the mean ages are somewhat below that 
typical of high-school students from the grade levels represented. 
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152 The Journal of Educational Psychology 


This is only one evidence of the fact that the sample is not representa- 
tive of junior and senior high-school students in general, but that it is 
instead affected by a high degree of selection with respect to the 
educational and occupational status of the parents. The mean 


TaBLeE I.—Sex DIFFERENCES IN VOCATIONAL INTEREST SCORES 
Senior High Group (122 Boys, 112 Girls) 



































Boys’ scores | Girls’ scores Diff 
Diff. * SDpitr. ~) - 
Mean | SD | Mean | SD SDpitr. 
1. Advertiser......... —135.0)214.0 9.4/174.6|—144.4|) 25.5 5.66 
2. Architect..........]— 47.0)155.5 2.0)151.2}— 49.0} 20.1 2.44 
3. Artist.............|—250.0/225.0|)— 37.5|230.3)}—212.5) 31.7 6.70 
4. C.P.A.............|—103.5) 92.1]— 94.4) 78.8)-— 9.1) 11.2 0.81 
5. Chemist...........|— 11.5|177.5|—142.5)135.5| 131.0} 20.6 6.36 
6. Physician..........|— 53.4)152.1)— 77.3)130.5 23.9} 18.5 1.29 
| eee — 14,0/180.0) —221.6)144.6) 207.6) 21.3 9.75 
i ae 21.3/205.4) —120.5)133.0) 141.8) 22.5 | 6.30 
9. Journalist......... |— 4.4|193.6}— 5.0)156.0 0.6) 22.9 .02 
10. Lawyer............{— 61.9/132.0)— 32.2)102.6|— 29.7] 15.4 1.93 
11. Life insurance sales .|—102.6)159.8|— 38.0) 11.6|— 64.6) 17.9 3.61 
> —216.0)198.1)| —121.2/219.6|— 94.8) 27.4 3.46 
13. Personnel manager.}— 52.4/104.8)— 92.0)118.0| 39.6) 14.6 2.71 
14. Psychologist....... —115.2/191.6)| —149.0/141.0 33.8) 21.9 1.54 
15. Purchasing agent...|— 19.7|102.4;— 94.5) 81.1 74.8) 12.0 6.23 
16. Real estate sales....}— 39.0)156.1;— 4.0)124.2)— 35.0) 18.4 1.90 
ee Is in os aeeewe — 75.6)183.7|— 62.5)133.0)— 13.1) 17.5 75 
18. Vacuum cleaner 
eo —106.4)169.5) —125.0)155.0 18.6} 21.2 .88 
19. Y.M.C.A. secretary| —156.4/210.8} —121.0/236.0 — 35.4) 29.4 1.20 
20. Physicistf......... — 86.0/303.0 —252.8/215.1, 166.8) 34.4 4.85 





* Differences recorded are mean score of boys minus mean score of girls in all 
tables. 


+ On the key for physicist only one hundred twenty-one boys and one hundred 
nine girls were scored. 


intelligence represented in the sample is roughly one standard devia- 
tion above that of an average high-school student body. Among the 
groups being compared, however, these selective factors apparently 
have not produced any serious differences between the sexes with 
respect to intelligence, home background, age, and other similar 
factors. 
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The younger children, despite their superior intelligence and socio- 
educational background, experienced some difficulty in filling out the 
blank. To overcome this, it was found necessary to allow children 
from junior high school more time than adults ordinarily require and 
also to furnish them orally with additional instructions for certain 
sections of the blank. Furthermore, there were some terms with 


TaBLs II.—Sex DiIrreERENCES IN VOCATIONAL INTEREST SCORES 
Junior High Group (127 Boys, 106 Girls) 









































Boys’ scores | Girls’ scores Diff 
Diff. | SDpise. | —— 
Mean | SD | Mean | SD SDoitt 
1. Advertiser.........|—123.7|202.5) 42.2/167.3)—165.9| 24.2 6.86 
2. Architect.......... — 4.9)127.0) 48.6)122.5|— 53.5) 16.4 3.26 
3. Artist.............]—119.2/243.8) 103.3)195.8) —222.5| 28.8 7.77 
4. C.P.A.............]|—128.8] 92.5)— 87.3] 72.0;— 36.5) 10.8 3.38 
5. Chemist...........|— 50.7|124.8|—141.1] 96.0} 90.4) 14.5 | 6.23 
6. Physician.......... 14.3)132.0, 14.6)114.5;— 0.3) 16.2 18 
7. Engineer.......... — 15.7|138.6)—151.9)105.5) 136.2) 16.0 8.51 
errr 58 .5|130.5)— 52.8/113.5) 111.3) 16.0 6.96 
9. Journalist.......... — 65.4/192.0) 81.2)151.0)—146.6) 22.5 6.52 
10. Lawyer............J}— 51.2/1381.4/— 2.5) 97.3)— 48.7) 15.1 3.23 
11. Life insurance sales.}— 76.2)120.3)}— 39.3)100.0|— 36.9) 14.4 2.56 
12. Minister...........|—320.9)198.0) —286 .2/205.8| — 34.7| 26.6 1.30 
13. Personnel manager .| —146.0)107.5| —208.7|116.4 62.7; 14.8 4.24 
14. Psychologist.......| —157.3|132.5) —167.2/114.4 9.9) 16.2 0.61 
15. Purchasing agent...;— 7.9) 79.5|— 98.8) 65.3 90.9) 9.5 9.57 
16. Real estate sales... .| — 26. 9132.5 11.8)125.5}— 38.7; 16.9 2.29 
i re —156.1/126.5) —170.7|134.0 14.6} 17.2 84 
18. Vacuum cleaner 
ee —122.9|172.8|—177.8)161.6| 54.9) 21.9 2.51 
19. Y.M.C.A. secretary | —287.8/238.0| —321.7|230.0) 33.9) 30.7 | 1.10 
Be Ree orn ciwes — 44.9/250.0) —132.7|197.3) 87.8) 29.3 | 3.00 








which many individuals were unfamiliar, or which referred to matters 
with which their experience was too limited to permit the expression 
of a like or dislike. ‘There were, as a result of this, many items to 
which a reponse of indifference was recorded. 

All of the interest blanks were scored on twenty of the occupational 
keys developed by Strong for use with men. Junior and senior high- 
school groups were considered separately and, within each group, the 
differences in mean scores for the two sexes were determined for each 
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of the twenty occupations. The method used paralleled that employed 
by Carter and Strong, thus facilitating comparison of results from the 
two studies. 


TaBLeE III.—Criticat Ratios oF Sex DIFFERENCES FOR RELATED GROUPS OF 

















OccuPaATIONS* 
Senior high data Junior high data 
Boys’ Girls’ Boys’ Girls’ 
mean mean mean mean 
higher higher higher higher 
Group I. 
SE ck cee wkae ee hen 9.75 8.51 
Ea ee eee 6.36 6.23 
OES F OST ET ESTE Tee 6.30 6.96 
A eee oe 4.85 binds 3.00 
EEOC OTT TOT 1.29 er -— 0.18 
6. Psychologist............... 1.54 eer 0.61 
aa gach nee ed gikaha 2.44 j 3.26 
a teense a i-ecwh 6.70 7.77 
Group II. 
A. 
EEE OTe ee? —e 1.93 3.23 
ee 0.02 ear Paral 6.52 
EE eer ee 5.66 game 6.86 
B. 
1. Real estate salesman....... 1.90 aaa 2.29 
2. Life insurance salesman.... ee 3.61 2.56 
Group III. 
A. 
i is atin uke le ica re 0.75 0.84 
I, ose kddadacswsses re 3.46 baa 1.30 
B. 
1. Personnel manager......... 2.71 ssid 4.24 
2. Y.M.C.A. secretary........ sath 1 20 1.10 
Group IV. . 
1. Purchasing agent........... 6.23 vr 9.57 
2. Vacuum cleaner salesman. . . 0.88 eae 2.51 
Group V. 
i ek oe oe Od eae ee 0.81 —e 3.38 











* The groupings are those proposed by Strong (1). 


The mean scores of boys and of girls on each of the twenty occupa- 
tional keys, together with the differences in these scores and the data 
necessary for their evaluation, are recorded for the senior high-school 
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group in Table I. 
assembled in Table II. 
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Similar data for the junior high-school are 


From the last column of Table I it can be seen that among the 
senior high-school group ning of the twenty occupational keys reveal 


TABLE I1V.—CoMPARISON OF SEX DIFFERENCES IN INTEREST SCORES FROM THREE 

















GROUPS 
Senior high | Junior high 
Carter and | school stu- | school stu- 
Strong one dents one dents one 
hundred hundred hundred 
boys, one | twenty-two | twenty-one 
hundred boys, one boys, one 
girls, hundred hundred six 
Diff. twelve girls, girls, 
SDpitr. Diff. Diff. 
SDpit. SDpi«t. 
2 ee rere rene a 9.07 9.75 8.51 
a elise eens banana ke eee 8.98 6.30 6.96 
ET Ee en eA 6.76 6.36 6.23 
4. Purchasing agent................ 6.34 6.23 9.57 
id kasd a bl Waa ae eS 5.27 4.85 3.00 
DR, IR cc cccscess 2.16 1.29 —0.18 
7. Mathematician........ 1.50 
se tc ened ecnewanees 0.84 1.54 0.61 
9. Personnel manager............... 0.70 2.71 4.24 
6 ook cede ee Kea ew O6 ne —0.52 —2.44 —3.26 
11. Vacuum cleaner salesman........ —0.53 0.88 2.51 
12. Real estate salesman............. —0.85 —1.90 —2.29 
cc ctkeee seas eenseans —1.27 
bs ttgwedee sted deaeean ee —2.31 —1.93 —3.23 
15. City school superintendent....... —3.36 
16. Y.M.C.A. secretary....... —3.42 —1.20 1.10 
TR ee Ee ee —3.45 —0.75 0.84 
ccc ceredeeseeeuienrweal —4.26 —6.70 —7.77 
es os whi ad wae ad ancaew aon —4.36 0.02 —6.52 
ED, cc ctaneseeucndsacw oes —4.58 —5.66 —6.86 
SS. cn Cc eee nese see ewes —4.62 —0.81 —3.38 
22. Life insurance salesman.......... —4.74 —3.61 —2.56 
i chu hdahd an ee eee ee —5.31 —3.46 —1.30 














sex differences greater than would ordinarily be expected to occur 


by chance. 


Likewise, it can be seen from Table II that among the 


younger group there are twelve keys that discriminate between the 
—_—_—, 
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responses of boys and girls to an. extent that suggests the existence of 
true differences. 

A rearrangement of the ratios of differences to their respective 
standard errors, presented in Table III, makes use of a classification 
of occupations published by Strong.! It is evident from this table 
that the observed sex differences generally tend to fall into a pattern 
related to Strong’s grouping. The resemblance between the sex 
differences found for the older and younger groups is least for certain 
of the occupations which Strong classifies as Group III. 

Finally, Table IV was prepared to afford a basis for comparing the 
sex differences here observed with those reported earlier by Carter 
and Strong.” The similarity of the sex differences characterizing the 
three groups of students is apparent from an inspection of the table. 
A rough measure of this resemblance is to be had in the rank-order 
correlations which have been computed as follows: 


Carter and Strong with senior-high-students................. .85 
Carter and Strong with junior-high-students................. .77 
Senior-high-students with junior-high-students............... .85 


The results of the present investigation may be considered as con- 
firming Carter and Strong’s conclusion regarding the existence of 
certain differences between the measured vocational interests of boys 
and girls of high-school age. Furthermore, the fact that the number 
of occupational keys for which clear-cut sex differences appear at the 
junior-high-school level is no fewer than the number of such differences 
occurring among the group from senior high school affords a basis for 
the tentative conclusion that the interests measured by the Strong 
blank are, among the type of children represented in the sample 
studied, well developed prior to age fourteen. If further application 
of the Strong blank demonstrates this to be true, it would seem desir- 
able to attempt the construction of a measuring instrument which 
avoids that blank’s difficulties of vocabulary and form, in order that 
growth of interests among much younger children may be studied 


objectively. 





‘Strong, E. K. Jr.: “Classification of Occupations by Interests.’’ Personnel 
Journal, Vol. x11, 1933, pp. 301-313. 

2 Carter, Harold D. and Strong, E. K. Jr.: ‘‘Sex Differences in Occupational 
Interests of High-School Students.” Personnel Journal, Vol. x11, 1933, pp. 
166-175. 
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BOOK REVIEWS 


CarL FE. SeasHore. Psychology of Music. New York: McGraw- 
Hill Book Co., Inc., 1938, pp. 408, figs. 89. 


Carl E. Seashore’s newest opus, Psychology of Music, represents the 
fruits of forty years of research. It is intended as a textbook for 
beginner students of music in scientific observation and reasoning 
about the art, and acts as a sequel to the same author’s Psychology 
of Musical Talent published nearly twenty years ago. While science 
and psychology of music are the specific objectives of Seashore’s 
research, much light is thrown also upon vocational and educational 
guidance and training for skills. 

Psychology of Music opens with an analysis of the normal musical 
mind. Some of the musician’s special faculties, such as imagination, 
memory, performance, musical intelligence, and sensory capacities, 
are briefly but capably sketched. 

Two additional preliminary chapters deal with The Musical 
Medium and The Science of Music. Since the sole medium through 
which the musician works is the sound wave, the science of music 
traces the vibrations of the sounding body, such as a reed, or the vocal 
cords, or stretched strings, through the air as air waves, and through 
the tympanic membrane, the bony system, the oval window, the 
liquid of the inner ear, and the receiving mechanism of the nerve 
cells as physical vibrations of material bodies. By means of phono- 
photography sound waves are intercepted, recorded, and analyzed on 
scientifically accurate measuring machines, such as the oscillograph 
and the Henrici Harmonic Analyzer, the latter being pictured at the 
beginning of the book. The author observes that while the cold 
details of musical facts can be recorded and organized by the psy- 
chologist, validity and interpretation depend upon an intimate 
knowledge of music and feeling for it. 

Drawing freely from his own many works (by a unique coincidence 
this being the fortieth of his forty years’ labors) and from over a 
hundred publications of those whom he affectionately calls his ‘‘com- 
rades in research” at the State University of Iowa, besides a great 
number of other authors and sources, the highlights of twenty-five 
topics are lucidly set forth. 

First, and certainly one of the best, is a chapter on A Musical Orna- 
ment, The Vibrato. A good vibrato is a pulsation of pitch, usually 
accompanied with synchronous pulsations of loudness and timbre, 
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of such an extent and rate as to give a pleasing flexibility, tenderness, 
and richness of tone. Three musical performance scores are illus- 
trated—two of voice and one of violin—in which the vibrato is checked 
against even or true pitch (frequency), time (in tenths of seconds), 
and intensity (in decibels). Vibrato presence, pitch extent, rate, 
form, intensity, and other characteristics are enumerated and analyzed. 
Among the findings likely to surprise the average musician, no less 
than the layman, is the fact that the width of the pulsation of pitch 
averages about a semi-tone for the vocal vibrato; and that the variation 
of individual vibrato cycles from this average in acceptable vibrato 
may be from 0.1 to 1.5 of a tone in a given singer! However, the 
vibrato is always heard as of very much smaller extent than it is in the 
physical tone because of certain hearing illusions. The extent of 
the pitch vibrato for violin is about a quarter tone and is fairly con- 
stant and regular. Besides the author’s own research on the subject 
reference is made to the works of Harold Seashore and Small. 

Following Vibrato come chapters on Frequency, Intensity, Time, 
and Timbre, in which audiograms, some phonograph records from the 
Seashore Measures of Musical Talent for group measuring of the 
Senses of Intensity and Time, and a suggested decibel-dynamic 
scale are significantly represented. 

About eighty pages are next devoted to the subjects of Sonance, 
Consonance, Rhythm, Hearing in Music, Imagining in Musie 
(excellent excerpts from the letters of famous composers are quoted 
to bolster the case for the superiority of auditory imagery in musicians), 
Thinking in Music, and Nature of Musical Feeling. 

The Timbre of Band and Orchestral Instruments, and chapters on 
Violin, Piano, and Voice follow, of which the latter three are of especial 
excellence. The Physical Basis of Piano Touch and Tone, by Otto 
Ortmann, Director of the Peabody Conservatory of Music, is cited 
by the author to be the best available book on the subject for musicians. 
Additional contributions to the chapter on Piano come from the works 
of White, Hart, Fuller, Lusby and Ghosh. It is found that the pianist 
has under hi: control only two of the four factors in music; namely, 
intensity and duration. Pitch and timbre are determined primarily 
by the composer and the instrument. Therefore, no amount of 
vibrating, rocking, or caressing of the key after it has once hit bottom 
can modify the action upon the string. Photograms made by means 
of a specially constructed piano camera (explained in detail in the 
text) illustrate numerous other important truths. 
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Seashore believes that the violin probably produces the most beauti- 
ful tone of all instruments. A number of tone spectra are shown illus- 
trating the richness and even distribution of the partials. Although 
violinists impart a vibrato to nearly all tones at the rate of 6.5 cycles 
per second, deviate over 60 per cent of the time from tempered scale 
pitches, and over 80 per cent of the time from note values (over- 
holding and under-holding), it is found that beauty lies in the artistic 
deviation from the precise and uniform in all attributes of tone. 

Pattern scores of songs and singers are analyzed for tone aspect, 
intensity, temporal aspect, rhythm, and timbre and sonance. Pre- 
viously in the book it is established that in vibrato there are no marked 
and consistent variations with the sex of the singer, the vowel quality, 
the musical mode, the pitch level, or the loudness of tone. The extent 
of the vibrato does not differentiate emotions expressed. Some of 
the works drawn from in the Voice and Violin chapters besides Sea- 
shore’s contributions are by Bartholomew, Cheslock, Easley, and 
Fletcher. 

Space permits merely the mention of other valuable chapters on 
Principles of Guidance in Music, Measures of Musical Talent, Analyses 
of Talent in a Music School and in the Public School. Further 
chapters are on the Inheritance of Musical Talent (the indication is 
that the inheritance of musical capacities seems to follow Mendelian 
principles), Primitive Music (musical anthropology through phono- 
photography), The Development of Musical Skills, and Musical 
{sthetics. 

It is only natural that in a book of such fullness, not all of the 
deductions are final and unqualified, and that some minor exceptions 
might be noted here and there. Problematical it is to what extent 
singing teachers will share the author’s optimism when he states 
(pp. 42-43): “A talented student who has no vibrato may develop 
it to a very satisfactory degree in just a few lessons.’ His use of 
musical terminology such as appears on p. 91, for instance, may be 
confusing: ‘‘Tempo rubato—depends on fine shadings in time to 
produce the desired modulation.’’ Effect instead of modulation, 
perhaps, would be better, since modulation is generally accepted by 
musicians to mean a process of tonality change. Nor do musicians 
think of vibrato as an ornament (chapter 4, A Musical Ornament, 
The Vibrato) in the same sense as they think of trills, turns, and 
grace notes as ornaments. It does not appear (p. 188) if any tim- 
bre differences exist between the cornet and the trumpet—the latter, 
despite its equal popularity, not being mentioned at all. 
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An appendix which does not form an integral part of the text is 
included, in which the author quotes from the December, 1937, 
Music Educators Journal his reply to Mursell’s theory in regard to 
methods of evaluating musical talent. Mursell’s contention appears 
to be that there is only one satisfactory method of finding out whether 
the Seashore tests really measure musical ability, and that is to ascer- 
tain whether persons rating high or low or medium on these tests 
also rate high and low and medium in what one may call “musical 
behavior”; 7.e., sight-singing, playing the piano, getting through 
courses in theory and applied music, and the like. Seashore’s main 
point in reply is that a good profile in his tests need not be in itself 
a guaranty of musical success, but it may furnish a good lead and may 
become a basis for encouragement. 

A large bibliography, and indices to authors, musicians, com- 
positions, and subjects bring the book to a close. ss 

In the reviewer’s opinion Seashore has produced a truly significant 
textbook on the psychology of music from the scientific standpoint. 
It is, perhaps, the most comprehensive survey of the field in one 
volume. Moreover, the writing is clear and direct, and it should, 
therefore, appeal also to the interested layman. Despite some 
occasional misgivings by the author—(he fears, for instance, that 
his chapter on A Musical Ornament, The Vibrato, “will make heavy 
reading,” and that the one on Voice “‘has undoubtedly proved to be a 
severe assignment for study’”’)—it is an easy and excellent book to 
read. Louis CHESLOCK. 

Peabody Conservatory of Music. 





